Topic Modeling¶

Topic Modeling

Topic modeling is a machine learning technique that automatically analyzes text data to determine cluster words for a set of documents. This is known as ‘unsupervised’ machine learning because it doesn’t require a predefined list of tags or training data that’s been previously classified by humans.

I chose 3 topic modeling techniques:

  • Latent Dirichlet Allocation (LDA)
  • Latent Semantic Analysis (LSA)
  • BERTopic

I will implement and compare those topic modeling techniques on the 20 newsgroups dataset, which is a collection of approximately 20,000 newsgroup documents, partitioned (nearly) evenly across 20 different newsgroups. The 20 newsgroups collection has become a popular data set for experiments in text applications of machine learning techniques, such as text classification and text clustering.

Importing the Relevant Libraries¶

In [84]:
# Importing general libraries 
import re
import numpy as np
import pandas as pd
from pprint import pprint

# Importing the Gensim library
import gensim
import gensim.corpora as corpora
from gensim.utils import simple_preprocess
from gensim.models import CoherenceModel

# I will use this library for implementing the truncated singular value decomposition for the LSA model
from gensim.models import LsiModel 


# Importing nltk and downloading stopwords 
import nltk
nltk.download('stopwords')

# Importing spacy for lemmatization
import spacy

# Importing the BERTopic model
from bertopic import BERTopic
# Importing the sentence-transformers package for the purpose of document embeddings
from sentence_transformers import SentenceTransformer
from sentence_transformers import *
# Importing UMAP for dimensionality reduction in the BERTopic model
import umap
# Importing HDBSCAN to perform its clustering
import hdbscan

# Importing various dimensionality reduction and clustering techniques 
from sklearn.cluster import MiniBatchKMeans
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE


# Importing LexRank, an unsupervised approach to text summarization based on graph-based centrality scoring of sentences
from lexrank import *
# Importing the torch package  
import torch


# Importing plotting tools
import pyLDAvis
import pyLDAvis.gensim_models  
import matplotlib.pyplot as plt
import matplotlib.cm as cm
%matplotlib inline

# Enabling logging for gensim
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.ERROR)

# Importing warnings 
import warnings
warnings.filterwarnings("ignore",category=DeprecationWarning)
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Yoni\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!

Preparing Stopwords¶

In [85]:
# Importing NLTK Stop words
from nltk.corpus import stopwords
stop_words = stopwords.words('english')
stop_words.extend(['from', 'subject', 're', 'edu', 'use'])

Importing the 20 newsgroups DataSet¶

In [86]:
# Importing Dataset
df = pd.read_json('newsgroups.json')
print(df.target_names.unique())
df.head(10)
['rec.autos' 'comp.sys.mac.hardware' 'comp.graphics' 'sci.space'
 'talk.politics.guns' 'sci.med' 'comp.sys.ibm.pc.hardware'
 'comp.os.ms-windows.misc' 'rec.motorcycles' 'talk.religion.misc'
 'misc.forsale' 'alt.atheism' 'sci.electronics' 'comp.windows.x'
 'rec.sport.hockey' 'rec.sport.baseball' 'soc.religion.christian'
 'talk.politics.mideast' 'talk.politics.misc' 'sci.crypt']
Out[86]:
content target target_names
0 From: lerxst@wam.umd.edu (where's my thing)\nS... 7 rec.autos
1 From: guykuo@carson.u.washington.edu (Guy Kuo)... 4 comp.sys.mac.hardware
2 From: twillis@ec.ecn.purdue.edu (Thomas E Will... 4 comp.sys.mac.hardware
3 From: jgreen@amber (Joe Green)\nSubject: Re: W... 1 comp.graphics
4 From: jcm@head-cfa.harvard.edu (Jonathan McDow... 14 sci.space
5 From: dfo@vttoulu.tko.vtt.fi (Foxvog Douglas)\... 16 talk.politics.guns
6 From: bmdelane@quads.uchicago.edu (brian manni... 13 sci.med
7 From: bgrubb@dante.nmsu.edu (GRUBB)\nSubject: ... 3 comp.sys.ibm.pc.hardware
8 From: holmes7000@iscsvax.uni.edu\nSubject: WIn... 2 comp.os.ms-windows.misc
9 From: kerr@ux1.cso.uiuc.edu (Stan Kerr)\nSubje... 4 comp.sys.mac.hardware

Removing Emails and Newline Characters¶

In [87]:
# Converting to list
data = df.content.values.tolist()

# Removing Emails
data = [re.sub('\S*@\S*\s?', '', sent) for sent in data]

# Removing new line characters
data = [re.sub('\s+', ' ', sent) for sent in data]

# Removing distracting single quotes
data = [re.sub("\'", "", sent) for sent in data]

pprint(data[:1])
['From: (wheres my thing) Subject: WHAT car is this!? Nntp-Posting-Host: '
 'rac3.wam.umd.edu Organization: University of Maryland, College Park Lines: '
 '15 I was wondering if anyone out there could enlighten me on this car I saw '
 'the other day. It was a 2-door sports car, looked to be from the late 60s/ '
 'early 70s. It was called a Bricklin. The doors were really small. In '
 'addition, the front bumper was separate from the rest of the body. This is '
 'all I know. If anyone can tellme a model name, engine specs, years of '
 'production, where this car is made, history, or whatever info you have on '
 'this funky looking car, please e-mail. Thanks, - IL ---- brought to you by '
 'your neighborhood Lerxst ---- ']

Tokenizing Words and Clean-Up Text¶

In [88]:
def sent_to_words(sentences):
    for sentence in sentences:
        yield(gensim.utils.simple_preprocess(str(sentence), deacc=True))  # deacc=True removes punctuations

data_words = list(sent_to_words(data))

print(data_words[:1])
[['from', 'wheres', 'my', 'thing', 'subject', 'what', 'car', 'is', 'this', 'nntp', 'posting', 'host', 'rac', 'wam', 'umd', 'edu', 'organization', 'university', 'of', 'maryland', 'college', 'park', 'lines', 'was', 'wondering', 'if', 'anyone', 'out', 'there', 'could', 'enlighten', 'me', 'on', 'this', 'car', 'saw', 'the', 'other', 'day', 'it', 'was', 'door', 'sports', 'car', 'looked', 'to', 'be', 'from', 'the', 'late', 'early', 'it', 'was', 'called', 'bricklin', 'the', 'doors', 'were', 'really', 'small', 'in', 'addition', 'the', 'front', 'bumper', 'was', 'separate', 'from', 'the', 'rest', 'of', 'the', 'body', 'this', 'is', 'all', 'know', 'if', 'anyone', 'can', 'tellme', 'model', 'name', 'engine', 'specs', 'years', 'of', 'production', 'where', 'this', 'car', 'is', 'made', 'history', 'or', 'whatever', 'info', 'you', 'have', 'on', 'this', 'funky', 'looking', 'car', 'please', 'mail', 'thanks', 'il', 'brought', 'to', 'you', 'by', 'your', 'neighborhood', 'lerxst']]

Creating Bigram and Trigram Models¶

In [89]:
# Building the bigram and trigram models
bigram = gensim.models.Phrases(data_words, min_count=5, threshold=100) # higher threshold fewer phrases.
trigram = gensim.models.Phrases(bigram[data_words], threshold=100)  

# Faster way to get a sentence clubbed as a trigram/bigram
bigram_mod = gensim.models.phrases.Phraser(bigram)
trigram_mod = gensim.models.phrases.Phraser(trigram)

# See trigram example
print(trigram_mod[bigram_mod[data_words[0]]])
['from', 'wheres', 'my', 'thing', 'subject', 'what', 'car', 'is', 'this', 'nntp_posting_host', 'rac_wam_umd_edu', 'organization', 'university', 'of', 'maryland_college_park', 'lines', 'was', 'wondering', 'if', 'anyone', 'out', 'there', 'could', 'enlighten', 'me', 'on', 'this', 'car', 'saw', 'the', 'other', 'day', 'it', 'was', 'door', 'sports', 'car', 'looked', 'to', 'be', 'from', 'the', 'late', 'early', 'it', 'was', 'called', 'bricklin', 'the', 'doors', 'were', 'really', 'small', 'in', 'addition', 'the', 'front_bumper', 'was', 'separate', 'from', 'the', 'rest', 'of', 'the', 'body', 'this', 'is', 'all', 'know', 'if', 'anyone', 'can', 'tellme', 'model', 'name', 'engine', 'specs', 'years', 'of', 'production', 'where', 'this', 'car', 'is', 'made', 'history', 'or', 'whatever', 'info', 'you', 'have', 'on', 'this', 'funky', 'looking', 'car', 'please', 'mail', 'thanks', 'il', 'brought', 'to', 'you', 'by', 'your', 'neighborhood', 'lerxst']

Removing Stopwords and Making Bigrams and Lemmatization¶

In [90]:
# Defining functions for stopwords, bigrams, trigrams and lemmatization
def remove_stopwords(texts):
    return [[word for word in simple_preprocess(str(doc)) if word not in stop_words] for doc in texts]

def make_bigrams(texts):
    return [bigram_mod[doc] for doc in texts]

def make_trigrams(texts):
    return [trigram_mod[bigram_mod[doc]] for doc in texts]

def lemmatization(texts, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV']):
    """https://spacy.io/api/annotation"""
    texts_out = []
    for sent in texts:
        doc = nlp(" ".join(sent)) 
        texts_out.append([token.lemma_ for token in doc if token.pos_ in allowed_postags])
    return texts_out
In [91]:
# Removing Stop Words
data_words_nostops = remove_stopwords(data_words)

# Forming Bigrams
data_words_bigrams = make_bigrams(data_words_nostops)

# Initializing spacy 'en' model, keeping only tagger component (for efficiency)
# python3 -m spacy download en
nlp = spacy.load("en_core_web_sm", disable=['parser', 'ner'])

# Preforming lemmatization keeping only noun, adj, vb, adv
data_lemmatized = lemmatization(data_words_bigrams, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV'])

print(data_lemmatized[:1])
[['s', 'thing', 'car', 'nntp_poste', 'host', 'park', 'line', 'wonder', 'enlighten', 'car', 'see', 'day', 'door', 'sport', 'car', 'look', 'late', 'early', 'call', 'door', 'really', 'small', 'addition', 'separate', 'rest', 'body', 'know', 'tellme', 'model', 'name', 'engine', 'spec', 'year', 'production', 'car', 'make', 'history', 'info', 'funky', 'look', 'car', 'mail', 'thank', 'bring', 'neighborhood', 'lerxst']]

Creating the Dictionary and Corpus needed for Topic Modeling¶

In [92]:
# Creating Dictionary
id2word = corpora.Dictionary(data_lemmatized)

# Creating Corpus
texts = data_lemmatized

# Term Document Frequency
corpus = [id2word.doc2bow(text) for text in texts]

# Viewing the Term Document Frequency
print(corpus[:1])
[[(0, 1), (1, 1), (2, 1), (3, 1), (4, 5), (5, 1), (6, 2), (7, 1), (8, 1), (9, 1), (10, 1), (11, 1), (12, 1), (13, 1), (14, 1), (15, 1), (16, 1), (17, 1), (18, 2), (19, 1), (20, 1), (21, 1), (22, 1), (23, 1), (24, 1), (25, 1), (26, 1), (27, 1), (28, 1), (29, 1), (30, 1), (31, 1), (32, 1), (33, 1), (34, 1), (35, 1), (36, 1), (37, 1), (38, 1), (39, 1)]]
In [93]:
# Human readable format of corpus (term-frequency)
[[(id2word[id], freq) for id, freq in cp] for cp in corpus[:1]]
Out[93]:
[[('addition', 1),
  ('body', 1),
  ('bring', 1),
  ('call', 1),
  ('car', 5),
  ('day', 1),
  ('door', 2),
  ('early', 1),
  ('engine', 1),
  ('enlighten', 1),
  ('funky', 1),
  ('history', 1),
  ('host', 1),
  ('info', 1),
  ('know', 1),
  ('late', 1),
  ('lerxst', 1),
  ('line', 1),
  ('look', 2),
  ('mail', 1),
  ('make', 1),
  ('model', 1),
  ('name', 1),
  ('neighborhood', 1),
  ('nntp_poste', 1),
  ('park', 1),
  ('production', 1),
  ('really', 1),
  ('rest', 1),
  ('s', 1),
  ('see', 1),
  ('separate', 1),
  ('small', 1),
  ('spec', 1),
  ('sport', 1),
  ('tellme', 1),
  ('thank', 1),
  ('thing', 1),
  ('wonder', 1),
  ('year', 1)]]

Building the LDA Topic Model¶

Latent Dirichlet Allocation (LDA) is a generative statistical model that explains a set of observations through unobserved groups, and each group explains why some parts of the data are similar. In this, observations (e.g., words) are collected into documents, and each word's presence is attributable to one of the document's topics. Each document will contain a small number of topics. LDA is one of the most popular topic modeling methods.

In [94]:
# Building the LDA model
lda_model = gensim.models.ldamodel.LdaModel(corpus=corpus,
                                           id2word=id2word,
                                           num_topics=20, 
                                           random_state=100,
                                           update_every=1, # Determines how often the model parameters should be updated
                                           chunksize=100, # The number of documents to be used in each training chunk
                                           passes=10, # Total number of training passes
                                           alpha='auto',
                                           per_word_topics=True)

Viewing the Topics in The LDA Model¶

In [95]:
# Printing the Keyword in the 20 topics
pprint(lda_model.print_topics())
doc_lda = lda_model[corpus]
[(0,
  '0.024*"kill" + 0.023*"live" + 0.021*"death" + 0.017*"die" + '
  '0.017*"physical" + 0.015*"center" + 0.014*"bike" + 0.014*"attack" + '
  '0.012*"israeli" + 0.012*"fire"'),
 (1,
  '0.621*"ax" + 0.018*"slow" + 0.014*"brain" + 0.014*"review" + 0.012*"mb" + '
  '0.011*"clipper_chip" + 0.010*"sc" + 0.010*"printer" + 0.009*"box" + '
  '0.008*"mouse"'),
 (2,
  '0.075*"space" + 0.063*"gun" + 0.022*"launch" + 0.021*"earth" + '
  '0.019*"firearm" + 0.017*"orbit" + 0.017*"mission" + 0.017*"series" + '
  '0.015*"vehicle" + 0.015*"year"'),
 (3,
  '0.150*"com" + 0.048*"mount" + 0.046*"apple" + 0.037*"ram" + '
  '0.026*"corporation" + 0.025*"frame" + 0.025*"task" + 0.022*"spring" + '
  '0.020*"locate" + 0.019*"spacecraft"'),
 (4,
  '0.024*"evidence" + 0.019*"believe" + 0.016*"claim" + 0.016*"reason" + '
  '0.014*"man" + 0.014*"exist" + 0.012*"sense" + 0.012*"book" + 0.012*"life" + '
  '0.011*"faith"'),
 (5,
  '0.024*"thank" + 0.024*"line" + 0.019*"program" + 0.018*"file" + '
  '0.017*"mail" + 0.017*"system" + 0.014*"card" + 0.014*"include" + '
  '0.014*"send" + 0.013*"run"'),
 (6,
  '0.322*"drive" + 0.080*"disk" + 0.054*"scsi" + 0.036*"gateway" + '
  '0.035*"motherboard" + 0.015*"bank" + 0.015*"please_respond" + '
  '0.014*"greatly_appreciate" + 0.012*"fast" + 0.012*"n"'),
 (7,
  '0.099*"nhl" + 0.070*"cop" + 0.026*"enable" + 0.025*"police" + 0.020*"plot" '
  '+ 0.018*"conservative" + 0.015*"row" + 0.014*"neat" + 0.014*"closely" + '
  '0.011*"sharp"'),
 (8,
  '0.073*"directory" + 0.061*"battery" + 0.027*"phase" + 0.019*"consult" + '
  '0.016*"sustain" + 0.014*"weeks_ago" + 0.013*"scott_roby" + 0.010*"ave" + '
  '0.009*"space_shuttle" + 0.009*"powerbook"'),
 (9,
  '0.196*"window" + 0.058*"do" + 0.056*"monitor" + 0.054*"character" + '
  '0.040*"section" + 0.039*"recommend" + 0.029*"usenet" + 0.028*"font" + '
  '0.023*"workstation" + 0.020*"laboratory"'),
 (10,
  '0.095*"season" + 0.053*"pen" + 0.044*"trade" + 0.042*"objective" + '
  '0.040*"rational" + 0.039*"star" + 0.030*"morality" + 0.030*"past" + '
  '0.027*"predict" + 0.024*"penguin"'),
 (11,
  '0.045*"soldier" + 0.042*"armenian" + 0.040*"village" + 0.037*"greek" + '
  '0.027*"turk" + 0.027*"turkish" + 0.025*"occupy" + 0.019*"terrorism" + '
  '0.017*"northern" + 0.014*"inhabitant"'),
 (12,
  '0.053*"upgrade" + 0.047*"pack" + 0.043*"library" + 0.040*"dog" + '
  '0.038*"status" + 0.034*"clock" + 0.028*"floppy" + 0.025*"electrical" + '
  '0.025*"ftp_site" + 0.025*"routine"'),
 (13,
  '0.031*"write" + 0.022*"make" + 0.021*"know" + 0.021*"say" + 0.020*"think" + '
  '0.020*"article" + 0.019*"people" + 0.015*"see" + 0.012*"thing" + '
  '0.012*"way"'),
 (14,
  '0.092*"team" + 0.087*"game" + 0.061*"play" + 0.056*"win" + 0.044*"year" + '
  '0.027*"division" + 0.023*"score" + 0.022*"wing" + 0.021*"fan" + '
  '0.019*"run"'),
 (15,
  '0.070*"state" + 0.059*"government" + 0.055*"law" + 0.037*"right" + '
  '0.022*"country" + 0.021*"protect" + 0.018*"pin" + 0.017*"crime" + '
  '0.017*"watch" + 0.016*"citizen"'),
 (16,
  '0.047*"line" + 0.044*"get" + 0.034*"go" + 0.030*"nntp_poste" + '
  '0.027*"organization" + 0.023*"host" + 0.021*"m" + 0.019*"good" + '
  '0.015*"look" + 0.014*"time"'),
 (17,
  '0.060*"key" + 0.047*"system" + 0.034*"chip" + 0.030*"bit" + '
  '0.029*"technology" + 0.023*"public" + 0.023*"phone" + 0.022*"datum" + '
  '0.021*"cpu" + 0.018*"encryption"'),
 (18,
  '0.048*"problem" + 0.035*"use" + 0.015*"talk" + 0.014*"work" + 0.014*"high" '
  '+ 0.014*"science" + 0.010*"set" + 0.010*"value" + 0.010*"current" + '
  '0.010*"reference"'),
 (19,
  '0.044*"model" + 0.040*"device" + 0.036*"wire" + 0.033*"power" + '
  '0.032*"replace" + 0.030*"bus" + 0.026*"unit" + 0.025*"internal" + '
  '0.023*"ground" + 0.022*"external"')]

Computing Model Perplexity and Coherence Score¶

In [96]:
# Computing Perplexity
print('\nPerplexity Score: ', lda_model.log_perplexity(corpus))  # a measure of how good the model is. lower the better.

# Computing Coherence Score
coherence_model_lda = CoherenceModel(model=lda_model, texts=data_lemmatized, dictionary=id2word, coherence='c_v')
coherence_lda = coherence_model_lda.get_coherence()
print('\nCoherence Score: ', coherence_lda)
Perplexity Score:  -13.257142263819764

Coherence Score:  0.484063757142487

Visualizing the Topics-Keywords¶

In [97]:
# Visualizing the topics using pyLDAvis package's interactive chart
pyLDAvis.enable_notebook()
vis = pyLDAvis.gensim_models.prepare(lda_model, corpus, id2word)
vis
Out[97]:

Finding the Optimal Number of Topics for LDA¶

My Proposal for finding the optimal number of topics is to build many LDA models with different values of number of topics (k) and picking the one that gives the highest coherence score. Choosing that optimal ‘k’ usually offers meaningful and interpretable topics.

In [80]:
def compute_coherence_values(dictionary, corpus, texts, limit, start=2, step=3):
    """
    Compute c_v coherence for various number of topics

    Parameters:
    ----------
    dictionary : Gensim dictionary
    corpus : Gensim corpus
    texts : List of input texts
    limit : Max num of topics

    Returns:
    -------
    model_list : List of LDA topic models
    coherence_values : Coherence values corresponding to the LDA model with respective number of topics
    """
    
    coherence_values = []
    model_list = []
    for num_topics in range(start, limit, step):
        model = gensim.models.ldamodel.LdaModel(corpus=corpus, num_topics=num_topics, id2word=id2word, random_state=100,
                                           update_every=1, 
                                           chunksize=100, 
                                           passes=10, 
                                           alpha='auto',
                                           per_word_topics=True)
        model_list.append(model)                                
        coherencemodel = CoherenceModel(model=model, texts=texts, dictionary=dictionary, coherence='c_v')
        coherence_values.append(coherencemodel.get_coherence())
        
    return model_list, coherence_values
In [81]:
model_list, coherence_values = compute_coherence_values(dictionary=id2word, corpus=corpus, texts=data_lemmatized, start=2, limit=40, step=6)
In [82]:
# Plotting the graph for the purpose of choosing the optimal number of LDA topics 
limit=40; start=2; step=6;
x = range(start, limit, step)
plt.plot(x, coherence_values)
plt.title("Choosing the Optimal Number of LDA Topics Based on the Coherence Score")
plt.xlabel("Number of Topics")
plt.ylabel("Coherence Score")
plt.legend(("coherence_values"), loc='best')
plt.show()
In [108]:
# Printing the coherence scores
for m, cv in zip(x, coherence_values):
    print("Num Topics =", m, " has Coherence Value of", round(cv, 4))
Num Topics = 2  has Coherence Value of 0.5698
Num Topics = 8  has Coherence Value of 0.5081
Num Topics = 14  has Coherence Value of 0.5075
Num Topics = 20  has Coherence Value of 0.4841
Num Topics = 26  has Coherence Value of 0.4575
Num Topics = 32  has Coherence Value of 0.4496
Num Topics = 38  has Coherence Value of 0.4384

According to the coherence score graph and scores, after topic 15 there is a decrease in the coherence score. Between topic 8 and topic 15 there is no change in the coherence score. Based on that, I will choose the model with 8 topics for the purpose of optimizing the LDA model. The reason for choosing 8 topics is because when choosing a k that is too large (like 14 or 15 topics), I saw the same keywords being repeated in multiple topics.

In [110]:
# Selecting the chosen LDA model and printing the topics (8 topics)
optimal_model = model_list[1]
model_topics = optimal_model.show_topics(formatted=False)
pprint(optimal_model.print_topics(num_words=10))
[(0,
  '0.015*"say" + 0.014*"people" + 0.011*"write" + 0.009*"think" + 0.009*"know" '
  '+ 0.008*"make" + 0.007*"article" + 0.006*"believe" + 0.006*"see" + '
  '0.006*"come"'),
 (1,
  '0.016*"key" + 0.014*"year" + 0.013*"team" + 0.013*"game" + 0.011*"line" + '
  '0.009*"get" + 0.009*"play" + 0.009*"good" + 0.008*"go" + 0.007*"win"'),
 (2,
  '0.018*"law" + 0.014*"gun" + 0.014*"public" + 0.013*"government" + '
  '0.012*"state" + 0.010*"right" + 0.009*"system" + 0.009*"science" + '
  '0.009*"discussion" + 0.008*"case"'),
 (3,
  '0.017*"get" + 0.016*"article" + 0.016*"write" + 0.014*"line" + 0.012*"go" + '
  '0.009*"organization" + 0.009*"m" + 0.009*"car" + 0.008*"good" + '
  '0.007*"nntp_poste"'),
 (4,
  '0.018*"wire" + 0.015*"item" + 0.012*"steal" + 0.012*"clearly" + '
  '0.011*"ground" + 0.010*"lead" + 0.010*"laugh" + 0.009*"cable" + 0.008*"gay" '
  '+ 0.007*"motto"'),
 (5,
  '0.021*"line" + 0.012*"use" + 0.010*"system" + 0.010*"organization" + '
  '0.009*"nntp_poste" + 0.008*"host" + 0.008*"thank" + 0.007*"drive" + '
  '0.007*"get" + 0.007*"need"'),
 (6,
  '0.604*"ax" + 0.022*"_" + 0.019*"c" + 0.014*"pin" + 0.009*"gateway" + '
  '0.008*"rlk" + 0.008*"cx" + 0.005*"ei" + 0.005*"sy" + 0.004*"mc"'),
 (7,
  '0.047*"space" + 0.023*"dn" + 0.017*"launch" + 0.016*"earth" + '
  '0.016*"family" + 0.013*"orbit" + 0.013*"mission" + 0.011*"moon" + '
  '0.010*"satellite" + 0.009*"flight"')]

Finding the Most Representative Document for Each Topic¶

I may want to make more sense of what the topic is about. For that reason, the topic keywords may not be enough. So, to help me with understanding the topic, I would like to find the document a given topic has contributed to the most and infer the topic by reading that document.

In [112]:
# Grouping the top 5 sentences under each topic
sent_topics_sorteddf = pd.DataFrame()

sent_topics_outdf_grpd = df_topic_sents_keywords.groupby('Dominant_Topic')

for i, grp in sent_topics_outdf_grpd:
    sent_topics_sorteddf = pd.concat([sent_topics_sorteddf, 
                                             grp.sort_values(['Perc_Contribution'], ascending=[0]).head(1)], 
                                            axis=0)

# Resetting the Index    
sent_topics_sorteddf.reset_index(drop=True, inplace=True)

# Formatting
sent_topics_sorteddf.columns = ['Topic_Num', "Topic_Perc_Contrib", "Keywords", "Text"]

# Showing the final table
sent_topics_sorteddf.head(8)
Out[112]:
Topic_Num Topic_Perc_Contrib Keywords Text
0 0 0.9964 go, say, people, get, know, think, gun, time, ... Organization: University of Illinois at Chicag...
1 1 0.9953 drive, scsi, chip, line, bit, write, speed, fa... From: (GRUBB) Subject: Re: IDE vs SCSI Organiz...
2 2 0.9873 write, line, bike, article, car, organization,... From: (Beverly M. Zalan) Subject: Re: Frequent...
3 3 0.9971 year, team, line, game, go, write, get, articl... From: (peter.r.clark..jr) Subject: Re: Flyers ...
4 4 0.9999 ax, rlk, _, ei, m, qax, rk, r, cj, bf Subject: roman.bmp 07/14 From: (Cliff) Reply-T...
5 5 0.9952 key, encryption, use, ripem, line, government,... Subject: text of White House announcement and ...
6 6 0.9967 line, space, image, program, use, work, also, ... From: (Stephen D Brener) Subject: Intensive Ja...
7 7 0.9943 write, line, article, israeli, armenian, attac... From: (Adam Shostack) Subject: Re: was:Go Hezb...

Topic Distribution Across Documents¶

And the final step for this LDA model is understanding the volume and distribution of topics to judge how widely it was discussed.

In [116]:
# The Number of Documents for Each Topic
topic_counts = df_topic_sents_keywords['Dominant_Topic'].value_counts()

# The Percentage of Documents for Each Topic
topic_contribution = round(topic_counts/topic_counts.sum(), 4)

# The Topic Number and Keywords
topic_num_keywords = df_topic_sents_keywords[['Dominant_Topic', 'Topic_Keywords']]

# Concatenating Column wise
df_dominant_topics = pd.concat([topic_num_keywords, topic_counts, topic_contribution], axis=1)

# Changing the Column names
df_dominant_topics.columns = ['Dominant_Topic', 'Topic_Keywords', 'Num_Documents', 'Perc_Documents']

# Showing the final table
df_dominant_topics.head()
Out[116]:
Dominant_Topic Topic_Keywords Num_Documents Perc_Documents
0 8 line, write, get, article, nntp_poste, organiz... 1316.0 0.1163
1 1 drive, scsi, chip, line, bit, write, speed, fa... 287.0 0.0254
2 8 line, write, get, article, nntp_poste, organiz... 355.0 0.0314
3 8 line, write, get, article, nntp_poste, organiz... 1329.0 0.1175
4 11 line, file, get, write, window, use, program, ... 16.0 0.0014

Latent Semantic Analysis (LSA) Topic Modeling¶

Latent Semantic Analysis (LSA) also known as Latent Semantic Index (LSI) is a natural language processing method that analyzes relationships between a set of documents and the terms contained within. It uses singular value decomposition, a mathematical technique, to scan unstructured data to find hidden relationships between terms and concepts.

All the preprocessing work done on the 20 newsgroups dataset is still valid here. So, I can continue straight to the LSA model.

Again, I can obtain the coherence score with the Gensim module. Let’s see how the coherence score is for the LSA model for a total of 20 topics (The same number of topics as I initially chose for the LDA model. For comparison purposes).
Note - LsiModel does not function with the log_preplexity for the calculation of the perplexity score the same as LDA does. So, I will drop the perplexity score and focus my attention only to the coherence score.

In [120]:
lsi = LsiModel(corpus, num_topics=20, id2word=id2word, chunksize=100)

# Computing Coherence Score
coherence_model_lsi = CoherenceModel(model=lsi, texts=data_lemmatized, dictionary=id2word, coherence='c_v')
coherence_lsi = coherence_model_lsi.get_coherence()
print('\nCoherence Score: ', coherence_lsi)
Coherence Score:  0.5474654106268115

Now, let’s see how the coherence score is for the LSA model for the range of 2 to 20 topics. The reason for doing this, as before with the LDA model, Is for choosing the optimal number of topics for getting a more “polished” topic model to understand more coherently the corpus of documents.

In [117]:
# finding the coherence score with a different number of topics
for i in range(2,21):
    lsi = LsiModel(corpus, num_topics=i, id2word=id2word)
    coherence_model = CoherenceModel(model=lsi, texts=texts, dictionary=id2word, coherence='c_v')
    coherence_score = coherence_model.get_coherence()
    print('Coherence score with {} clusters: {}'.format(i, coherence_score))
Coherence score with 2 clusters: 0.5474654106268115
Coherence score with 3 clusters: 0.5804817573064794
Coherence score with 4 clusters: 0.5601378163216414
Coherence score with 5 clusters: 0.5978660958341103
Coherence score with 6 clusters: 0.6168266601159772
Coherence score with 7 clusters: 0.5934774020945769
Coherence score with 8 clusters: 0.5181731035330599
Coherence score with 9 clusters: 0.5397277320313394
Coherence score with 10 clusters: 0.5211226597237258
Coherence score with 11 clusters: 0.5158202036881903
Coherence score with 12 clusters: 0.5357748057219601
Coherence score with 13 clusters: 0.5211371341484472
Coherence score with 14 clusters: 0.4830237564401526
Coherence score with 15 clusters: 0.463579838268446
Coherence score with 16 clusters: 0.47692082683039305
Coherence score with 17 clusters: 0.48779930823677753
Coherence score with 18 clusters: 0.4801059780572923
Coherence score with 19 clusters: 0.4843422484510121
Coherence score with 20 clusters: 0.45692628070938657

According to the coherence scores, after topic 6 there is a decrease in the coherence score. Based on that, I will choose the model with 6 topics for the purpose of optimizing the LSA model. The reason for choosing 6 topics is because choosing a ‘k’ that marks the end of a rapid growth of topic coherence usually offers meaningful and interpretable topics.

Performing SVD¶

In [121]:
# performing SVD on the bag of words with the LsiModel to extract 6 topics
lsi = LsiModel(corpus, num_topics=6, id2word=id2word)
In [122]:
# finding the 10 words with the srongest association to the derived topics
for topic_num, words in lsi.print_topics(num_words=10):
    print('Words in {}: {}.'.format(topic_num, words))
Words in 0: 1.000*"ax" + 0.001*"qax" + 0.001*"m" + 0.001*"giz" + 0.001*"ei" + 0.001*"bhj_bhj" + 0.001*"giz_giz" + 0.000*"mf" + 0.000*"tq" + 0.000*"bhj_giz".
Words in 1: 0.243*"say" + 0.199*"file" + 0.197*"go" + 0.179*"get" + 0.168*"people" + 0.166*"know" + 0.144*"make" + 0.135*"see" + 0.132*"use" + 0.129*"also".
Words in 2: 0.409*"file" + -0.336*"say" + -0.251*"go" + 0.167*"image" + 0.159*"program" + -0.159*"know" + -0.158*"people" + -0.139*"think" + -0.137*"s" + -0.136*"come".
Words in 3: -0.581*"file" + -0.331*"entry" + 0.172*"system" + -0.135*"say" + 0.123*"use" + 0.122*"available" + -0.108*"output" + 0.107*"also" + -0.093*"program" + -0.092*"gun".
Words in 4: -0.382*"image" + 0.195*"privacy" + 0.182*"internet" + -0.153*"color" + 0.139*"anonymous" + -0.138*"format" + -0.135*"say" + -0.135*"available" + -0.133*"go" + -0.131*"version".
Words in 5: -0.302*"wire" + -0.222*"entry" + 0.200*"internet" + -0.190*"wiring" + 0.181*"privacy" + -0.172*"circuit" + -0.147*"ground" + 0.141*"file" + -0.131*"outlet" + 0.128*"anonymous".

¶

BERTopic¶

BERTopic is a topic modeling technique that leverages 🤗 transformers and c-TF-IDF to create dense clusters allowing for easily interpretable topics whilst keeping important words in the topic descriptions.

After preprocessing the dataset early in my project and reaching to a final lemmatized dataset (named: data_lemmatized) containing the final product words that I have been working with. Now, I want to add those words to my dataframe as a set of rows and their corresponding words. After doing that, I would like to convert does words back to sentences for the purpose of using the sentence transformer model from BERTopic.

In [9]:
# Adding new column to the dataframe (named: text_cleaned) 
# containing the different lemmatized words in each corresponding row. 
df['text_cleaned'] = data_lemmatized
In [10]:
# Function to make it back into a sentence 
def make_sentences(data,name):
    data[name]=data[name].apply(lambda x:' '.join([i+' ' for i in x]))
    # Removing double spaces if created
    data[name]=data[name].apply(lambda x:re.sub(r'\s+', ' ', x, flags=re.I))
In [11]:
# Converting all the texts back to sentences
make_sentences(df, 'text_cleaned')
In [12]:
df.head()
Out[12]:
content target target_names text_cleaned
0 From: lerxst@wam.umd.edu (where's my thing)\nS... 7 rec.autos s thing car nntp_poste host park line wonder e...
1 From: guykuo@carson.u.washington.edu (Guy Kuo)... 4 comp.sys.mac.hardware clock poll final call summary final call si cl...
2 From: twillis@ec.ecn.purdue.edu (Thomas E Will... 4 comp.sys.mac.hardware question organization purdue_university engine...
3 From: jgreen@amber (Joe Green)\nSubject: Re: W... 1 comp.graphics system division line nntp_poste host version_p...
4 From: jcm@head-cfa.harvard.edu (Jonathan McDow... 14 sci.space question organization line article pack rat wr...

Importing a Pre-Trained Model from SentenceTransformer¶

In [13]:
# Getting a model
model=SentenceTransformer('all-MiniLM-L12-v2')

Encodinng The Preprocessed Text Data¶

In [14]:
embeddings = model.encode(df['text_cleaned'])

Finding Optimal Clusters Using The K-Means Algorithm¶

K-means clustering is one of the simplest and popular unsupervised machine learning algorithms. The K-means algorithm identifies k number of centroids, and then allocates every data point to the nearest cluster, while keeping the centroids as small as possible.

In [15]:
def find_optimal_clusters(data, max_k):
    iters = range(2, max_k+1, 1)
    
    sse = []
    for k in iters:
        sse.append(MiniBatchKMeans(n_clusters=k, init_size=256, batch_size=512, random_state=20).fit(data).inertia_)
        print('Fit {} clusters'.format(k))
        
    f, ax = plt.subplots(1, 1)
    ax.plot(iters, sse, marker='o')
    ax.set_xlabel('Cluster Centers')
    ax.set_xticks(iters)
    ax.set_xticklabels(iters)
    ax.set_ylabel('SSE')
    ax.set_title('SSE by Cluster Center Plot')
In [16]:
find_optimal_clusters(embeddings, 20)
Fit 2 clusters
Fit 3 clusters
Fit 4 clusters
Fit 5 clusters
Fit 6 clusters
Fit 7 clusters
Fit 8 clusters
Fit 9 clusters
Fit 10 clusters
Fit 11 clusters
Fit 12 clusters
Fit 13 clusters
Fit 14 clusters
Fit 15 clusters
Fit 16 clusters
Fit 17 clusters
Fit 18 clusters
Fit 19 clusters
Fit 20 clusters

According to the plot, the highest fall in the SSE is from 2-3 which means that the most optimal cluster size is 2. I am going to try with both 2 and 3 clusters.

In [17]:
# Beginning with 2 clusters
clusters_2 = MiniBatchKMeans(n_clusters=2, init_size=1024, batch_size=2048, random_state=20).fit_predict(embeddings)
In [18]:
# Defining a function for the dimensionality reduction using different techniques
def plot_tsne_pca_umap(data, labels):
    max_label = max(labels)+1
    max_items = np.random.choice(range(data.shape[0]), size=3000, replace=False)
    
    reducer=umap.UMAP()
    pca = PCA(n_components=2).fit_transform(data[max_items,:])
    tsne = TSNE().fit_transform(PCA(n_components=50).fit_transform(data[max_items,:]))
    uma=reducer.fit_transform(PCA(n_components=50).fit_transform(data[max_items,:]))
    
    
    idx = np.random.choice(range(pca.shape[0]), size=320, replace=False)
    label_subset = labels[max_items]
    label_subset = [cm.hsv(i/max_label) for i in label_subset[idx]]
    
    f, ax = plt.subplots(1, 3, figsize=(14, 6))
    
    ax[0].scatter(pca[idx, 0], pca[idx, 1], c=label_subset)
    ax[0].set_title('PCA Cluster Plot')
    
    ax[1].scatter(tsne[idx, 0], tsne[idx, 1], c=label_subset)
    ax[1].set_title('TSNE Cluster Plot')
    
    ax[2].scatter(uma[idx,0],uma[idx,1],c=label_subset)
    ax[2].set_title('UMAP Cluster Plot')
    
plot_tsne_pca_umap(embeddings, clusters_2)

We can see various dimensionality reduction techniques such as UMAP from the BERTopic model, PCA and TSNE. All are plotted against 2 clusters found by the k-means algorithm. TSNE is able to differentiate the data in approximately 60 dimensions!!!

In [19]:
# Moving to 3 clusters
clusters_3 = MiniBatchKMeans(n_clusters=3, init_size=1024, batch_size=2048, random_state=20).fit_predict(embeddings)
In [20]:
plot_tsne_pca_umap(embeddings, clusters_3)

The main difference between TSNE and UMAP is the interpretation of the distance between objects or "clusters".

TSNE preserves local structure in the data.

UMAP claims to preserve both local and most of the global structure in the data. UMAP is faster than TSNE when it concern to:

  • Large number of data points
  • Number of embedding dimensions greater than 2 or 3
  • Large number of ambient dimensions in the dataset

Getting Topics Using BERTopic and SentenceTransformer Embeddings¶

In [21]:
model2 = BERTopic()
topics, probabilities = model2.fit_transform(df['text_cleaned'],embeddings)
In [22]:
# viewing how frequent certain topics are
model2.get_topic_freq().head()
Out[22]:
Topic Count
0 -1 3614
1 0 1113
2 1 544
3 2 452
4 3 414

The topic name -1 refers to all documents that did not have any topics assigned. Not all documents are forced towards a certain cluster. If no cluster could be found, then it is simply an outlier.

After generating topics and their probabilities, I can access the frequent topics that were generated.

In [23]:
model2.get_topic(0)
Out[23]:
[('team', 0.02531838599464123),
 ('game', 0.02438041078399716),
 ('player', 0.019756747153850802),
 ('play', 0.01716978232301229),
 ('season', 0.015572793356900618),
 ('hockey', 0.01301623122543666),
 ('win', 0.012775921704569709),
 ('year', 0.012687353632565911),
 ('nhl', 0.011446904475975189),
 ('score', 0.011411865794654865)]

I can infer from the keywords that the topic discussed in those documents relates to SPORTS

In [24]:
model2.get_topic(1)
Out[24]:
[('space', 0.025939969611006874),
 ('launch', 0.017670839520851994),
 ('satellite', 0.013311532375266005),
 ('orbit', 0.012928079366789044),
 ('mission', 0.012221057412043329),
 ('earth', 0.010954313253211286),
 ('moon', 0.009484047344214675),
 ('rocket', 0.009068061670905783),
 ('flight', 0.008840779966201714),
 ('spacecraft', 0.008528870856080634)]

I can infer from the keywords that the topic discussed in those documents relates to SPACE

In [25]:
model2.get_topic(2)
Out[25]:
[('car', 0.04418850328097452),
 ('engine', 0.014353478445618184),
 ('brake', 0.01240116371108509),
 ('drive', 0.011118568363411532),
 ('speed', 0.009631037912160334),
 ('tire', 0.009569073830006452),
 ('dealer', 0.00923751958500101),
 ('price', 0.009200924817820054),
 ('saturn', 0.008951214922059298),
 ('road', 0.008650953035079923)]

I can infer from the keywords that the topic discussed in those documents relates to AUTOMOBILE

In [26]:
model2.get_topic(3)
Out[26]:
[('key', 0.027696690150910704),
 ('encryption', 0.021039274687840403),
 ('entry', 0.014423183320468415),
 ('privacy', 0.014213320141352043),
 ('clipperchip', 0.01274244153028646),
 ('security', 0.01225822027176883),
 ('chip', 0.011385142823592757),
 ('clipper', 0.01099049829575111),
 ('secure', 0.010447401660497797),
 ('file', 0.0099918349958672)]

I can infer from the keywords that the topic discussed in those documents relates to NETWORK/CYBER/SECURITY

In [27]:
model2.get_topic(4)
Out[27]:
[('amp', 0.01933858776121388),
 ('audio', 0.014854640508848615),
 ('battery', 0.01454935142035318),
 ('sound', 0.013949004642053859),
 ('circuit', 0.013304362047248004),
 ('input', 0.013094917362452984),
 ('channel', 0.012716265382296114),
 ('stereo', 0.011944914740315237),
 ('output', 0.011420517863627255),
 ('voltage', 0.010665935415603046)]

I can infer from the keywords that the topic discussed in those documents relates to SOUND SYSTEM

In [31]:
model2.get_topics()
Out[31]:
{-1: [('ax', 0.020173035305477125),
  ('line', 0.005284320735989547),
  ('write', 0.004815641311499644),
  ('say', 0.004811078449112976),
  ('know', 0.004676313617227577),
  ('get', 0.004382500457664024),
  ('article', 0.004208136192195585),
  ('organization', 0.004207788605474893),
  ('nntpposte', 0.004063237701209443),
  ('people', 0.004057411787023439)],
 0: [('team', 0.02531838599464123),
  ('game', 0.02438041078399716),
  ('player', 0.019756747153850802),
  ('play', 0.01716978232301229),
  ('season', 0.015572793356900618),
  ('hockey', 0.01301623122543666),
  ('win', 0.012775921704569709),
  ('year', 0.012687353632565911),
  ('nhl', 0.011446904475975189),
  ('score', 0.011411865794654865)],
 1: [('space', 0.025939969611006874),
  ('launch', 0.017670839520851994),
  ('satellite', 0.013311532375266005),
  ('orbit', 0.012928079366789044),
  ('mission', 0.012221057412043329),
  ('earth', 0.010954313253211286),
  ('moon', 0.009484047344214675),
  ('rocket', 0.009068061670905783),
  ('flight', 0.008840779966201714),
  ('spacecraft', 0.008528870856080634)],
 2: [('car', 0.04418850328097452),
  ('engine', 0.014353478445618184),
  ('brake', 0.01240116371108509),
  ('drive', 0.011118568363411532),
  ('speed', 0.009631037912160334),
  ('tire', 0.009569073830006452),
  ('dealer', 0.00923751958500101),
  ('price', 0.009200924817820054),
  ('saturn', 0.008951214922059298),
  ('road', 0.008650953035079923)],
 3: [('key', 0.027696690150910704),
  ('encryption', 0.021039274687840403),
  ('entry', 0.014423183320468415),
  ('privacy', 0.014213320141352043),
  ('clipperchip', 0.01274244153028646),
  ('security', 0.01225822027176883),
  ('chip', 0.011385142823592757),
  ('clipper', 0.01099049829575111),
  ('secure', 0.010447401660497797),
  ('file', 0.0099918349958672)],
 4: [('amp', 0.01933858776121388),
  ('audio', 0.014854640508848615),
  ('battery', 0.01454935142035318),
  ('sound', 0.013949004642053859),
  ('circuit', 0.013304362047248004),
  ('input', 0.013094917362452984),
  ('channel', 0.012716265382296114),
  ('stereo', 0.011944914740315237),
  ('output', 0.011420517863627255),
  ('voltage', 0.010665935415603046)],
 5: [('gun', 0.03860060701032071),
  ('firearm', 0.022079495650725138),
  ('weapon', 0.01604032675937555),
  ('handgun', 0.014248186038456057),
  ('guncontrol', 0.013204935795688945),
  ('crime', 0.012231098238337441),
  ('militia', 0.010754490161962898),
  ('criminal', 0.010442234463384973),
  ('right', 0.009939366728343004),
  ('state', 0.009511172709302112)],
 6: [('israeli', 0.03498161655206188),
  ('arab', 0.02333324546867351),
  ('attack', 0.014498670170942956),
  ('palestinian', 0.013823798524660083),
  ('lebanese', 0.013076784284039937),
  ('soldier', 0.01083131119167203),
  ('civilian', 0.010717474121092729),
  ('peace', 0.010647873407253216),
  ('village', 0.010562804028671314),
  ('policyresearch', 0.010397257115494657)],
 7: [('mail', 0.05859183838406447),
  ('address', 0.039653532079916363),
  ('nntpposte', 0.028926860579300617),
  ('fax', 0.028072453990112805),
  ('line', 0.023905474118807283),
  ('thank', 0.023905446997748183),
  ('host', 0.023893325732244605),
  ('email', 0.02022815593741024),
  ('internet', 0.019417427384648196),
  ('send', 0.01690817064843392)],
 8: [('bike', 0.06496351838366486),
  ('motorcycle', 0.04161987326812391),
  ('ride', 0.033339920691596574),
  ('denizen', 0.01644598650813227),
  ('rider', 0.015606969427841228),
  ('dod', 0.014439237866022461),
  ('advice', 0.014260658573921633),
  ('recmotorcycle', 0.01101622498659433),
  ('mile', 0.010565035057966837),
  ('list', 0.010431387593701382)],
 9: [('printer', 0.11357426897096855),
  ('print', 0.05702196762028691),
  ('ink', 0.037670681513257344),
  ('bubblejet', 0.03399936646322734),
  ('deskjet', 0.030223464596539372),
  ('postscript', 0.022068532500621837),
  ('font', 0.021925969370348925),
  ('toner', 0.02190114915319171),
  ('scanner', 0.021497118112210577),
  ('laserprinter', 0.02041586291696539)],
 10: [('moral', 0.0562879140528236),
  ('morality', 0.04937240319191626),
  ('objective', 0.039358861754562526),
  ('objectivemorality', 0.02382439790274925),
  ('value', 0.01926253232906652),
  ('animal', 0.0170335317833117),
  ('specie', 0.016022288097005285),
  ('immoral', 0.015980602817449615),
  ('frankodwyer', 0.01587247291547734),
  ('objectivevalue', 0.014788685291665426)],
 11: [('sale', 0.04421519150451173),
  ('ticket', 0.03560509438240559),
  ('hotel', 0.025320778748856874),
  ('offer', 0.020792391842287098),
  ('up', 0.019649145319916045),
  ('sell', 0.019412553408750654),
  ('mail', 0.017011192683458447),
  ('line', 0.01610022869001741),
  ('nntpposte', 0.015731871940311417),
  ('host', 0.015185561743365012)],
 12: [('tap', 0.03852157841509703),
  ('government', 0.017165168354659616),
  ('police', 0.016459260752552853),
  ('key', 0.013102565060776318),
  ('trust', 0.012262961253193198),
  ('proposal', 0.010772005230080138),
  ('cop', 0.0107508962367134),
  ('good', 0.01013971052265792),
  ('clipper', 0.010081199103877435),
  ('wiretap', 0.009751336729956555)],
 13: [('polygon', 0.07664445916596928),
  ('point', 0.03893674489465752),
  ('sphere', 0.038081077837062834),
  ('plane', 0.02962244961012423),
  ('edge', 0.025271756955666896),
  ('routine', 0.024233563114594504),
  ('surface', 0.024054673537049768),
  ('algorithm', 0.022112335540845445),
  ('circle', 0.021977278212372386),
  ('intersection', 0.0209909259557251)],
 14: [('atheist', 0.04893884517223564),
  ('atheism', 0.04241221430090983),
  ('exist', 0.03061628445578153),
  ('belief', 0.019954752722363665),
  ('existence', 0.019832177993431413),
  ('argument', 0.018903613951371356),
  ('fallacy', 0.018257546718925766),
  ('religion', 0.01781803876424503),
  ('believe', 0.01705383473492208),
  ('theist', 0.015881311295293535)],
 15: [('sale', 0.02481006953286225),
  ('price', 0.023841358770878323),
  ('software', 0.020881658804993826),
  ('computer', 0.016678326700701854),
  ('manual', 0.015688282978400248),
  ('apple', 0.015344637470856386),
  ('upgrade', 0.014460154254993769),
  ('list', 0.013776007887434528),
  ('disk', 0.013642061298940215),
  ('fpu', 0.01313482395317774)],
 16: [('armenian', 0.05908545571010144),
  ('turkish', 0.03232436035834615),
  ('genocide', 0.02218832083890312),
  ('turk', 0.021737044775558313),
  ('serdarargic', 0.020108724995079718),
  ('massacre', 0.0131087001504206),
  ('escape', 0.012550743122526149),
  ('nazi', 0.012147942890944426),
  ('village', 0.011953088688120141),
  ('russian', 0.011040304622265906)],
 17: [('color', 0.08862597110490583),
  ('colormap', 0.06300349688874046),
  ('bit', 0.036821680682162496),
  ('visual', 0.03586169745192511),
  ('standardcolormap', 0.027231051529045616),
  ('depth', 0.025779459923563328),
  ('client', 0.02311747151849819),
  ('display', 0.02237733488102167),
  ('colour', 0.02097663016118978),
  ('screen', 0.019929988410128437)],
 18: [('food', 0.06346362245983232),
  ('msg', 0.053809079103557166),
  ('superstition', 0.03968808184006058),
  ('glutamate', 0.029865678433939512),
  ('taste', 0.028976187613034984),
  ('reaction', 0.02509663095018179),
  ('msgsensitivity', 0.02268231828920453),
  ('effect', 0.020898428274333546),
  ('chineserestaurant', 0.020202126349109253),
  ('study', 0.01839592003763124)],
 19: [('muslim', 0.027038665821038622),
  ('islamic', 0.02189971211644326),
  ('sex', 0.02138244676598906),
  ('rushdie', 0.021336207821779942),
  ('religion', 0.017249676003408942),
  ('gregg', 0.01660662658774099),
  ('woman', 0.016372612029085912),
  ('greggjaeger', 0.0151479230367414),
  ('depression', 0.01484043269589068),
  ('marriage', 0.014283669741626343)],
 20: [('sin', 0.03954147651400956),
  ('faith', 0.023837032523959863),
  ('prayer', 0.02337432034404498),
  ('love', 0.0233336933555494),
  ('salvation', 0.02002137922331453),
  ('god', 0.016535365656587368),
  ('commandment', 0.016302953911367178),
  ('christian', 0.013930437704047692),
  ('man', 0.012955090033372234),
  ('aid', 0.012928529000638221)],
 21: [('cx', 0.05832370768176702),
  ('sc', 0.049281712607800954),
  ('rlk', 0.04530922098345502),
  ('scx', 0.04037756188750717),
  ('format', 0.03610419888067061),
  ('sy', 0.03508745798357945),
  ('file', 0.03414332774952954),
  ('cj', 0.029244421971362275),
  ('image', 0.028151439573020434),
  ('cxs', 0.02523424370099946)],
 22: [('drive', 0.09344372848175847),
  ('boot', 0.04595409545372484),
  ('rombio', 0.03874923347123469),
  ('disk', 0.03424902237047937),
  ('harddisk', 0.03423138946956143),
  ('feature', 0.02658110759809402),
  ('controller', 0.02615418072733169),
  ('system', 0.022900671460440166),
  ('bio', 0.022485170835851626),
  ('westerndigital', 0.020802401740781223)],
 23: [('monitor', 0.11790037508292904),
  ('vga', 0.042051746246658576),
  ('vgamonitor', 0.031447296516934946),
  ('video', 0.022170960227908366),
  ('card', 0.020617569182584918),
  ('resolution', 0.020038556115396487),
  ('viewsonic', 0.018807589178834218),
  ('necfg', 0.017871128453768072),
  ('mode', 0.017082381048776945),
  ('tube', 0.01595638400144289)],
 24: [('libertarian', 0.03723633120642859),
  ('government', 0.03266469196975339),
  ('stevehendrick', 0.024651945300169297),
  ('employment', 0.013610441616640392),
  ('libertarianism', 0.01338696975840834),
  ('regulation', 0.012171652856425645),
  ('economy', 0.01196064469508917),
  ('socialism', 0.011890445807076263),
  ('welfare', 0.011011468313285774),
  ('country', 0.01098086900390882)],
 25: [('mhz', 0.07099157056321548),
  ('clock', 0.05160925998636245),
  ('processor', 0.04430575682151527),
  ('speed', 0.039090681455036164),
  ('pentium', 0.037424422958120616),
  ('cpu', 0.031423689658895755),
  ('instruction', 0.029018939111080833),
  ('performance', 0.025472325444006693),
  ('cisc', 0.024020501912967167),
  ('architecture', 0.023503914782384446)],
 26: [('church', 0.047154297017551694),
  ('pope', 0.02880190853941829),
  ('catholic', 0.026000488476967155),
  ('doctrine', 0.01887174743664526),
  ('schism', 0.01858300047898671),
  ('revelation', 0.015534341921285682),
  ('bishop', 0.01409634811141122),
  ('sin', 0.014042103033983134),
  ('schismatic', 0.01312873659161141),
  ('trinity', 0.012864148429777744)],
 27: [('nntpposte', 0.022755598467944557),
  ('host', 0.021176879195538328),
  ('edmccreary', 0.021115493697488413),
  ('robertweiss', 0.018321189768368),
  ('write', 0.017231737608341697),
  ('sequel', 0.01626017202090918),
  ('schiewer', 0.01591691245512373),
  ('rossborden', 0.01591691245512373),
  ('billconner', 0.015758619445428508),
  ('organization', 0.015378072101627381)],
 28: [('science', 0.033076526712751235),
  ('contradictory', 0.022663375356933465),
  ('universe', 0.01875060722802443),
  ('god', 0.018068516469823343),
  ('exist', 0.017663298968191594),
  ('origin', 0.017271037445979874),
  ('language', 0.013989916165823691),
  ('description', 0.013054099795900689),
  ('false', 0.012499045110776865),
  ('say', 0.012314570332154634)],
 29: [('survivor', 0.06208928932374977),
  ('dividianranch', 0.06092009314413341),
  ('atfburn', 0.055138755384206505),
  ('fire', 0.033817185835534475),
  ('atf', 0.024466796691918997),
  ('stove', 0.02281301663399489),
  ('napalm', 0.022352027414002062),
  ('woodstove', 0.019934146508674873),
  ('never', 0.019261053863044324),
  ('insideignite', 0.018709015639161566)],
 30: [('fire', 0.028961967979713576),
  ('compound', 0.02719764182424098),
  ('scottroby', 0.024573529737630752),
  ('child', 0.022306471874888475),
  ('tearga', 0.018088590972270738),
  ('murdersalmost', 0.017470221562138167),
  ('batf', 0.01686957221810202),
  ('koresh', 0.016822674139829536),
  ('agent', 0.013025369353435371),
  ('affair', 0.012771519232819505)],
 31: [('scsi', 0.21369832339677117),
  ('ide', 0.04499483387543342),
  ('mb', 0.043485931106553175),
  ('drive', 0.04244432841003108),
  ('device', 0.032370708308683535),
  ('esdi', 0.02844947732550255),
  ('fast', 0.0280105910197716),
  ('interface', 0.0270831224759238),
  ('pc', 0.024651399926943367),
  ('transfer', 0.02269526284693792)],
 32: [('rock', 0.04170414651532044),
  ('kid', 0.03432825336725548),
  ('warning', 0.030270818333784644),
  ('overpass', 0.028273063823245258),
  ('car', 0.026368547139087272),
  ('teenager', 0.022725033585320197),
  ('read', 0.017638208635995623),
  ('kill', 0.015844630362782528),
  ('bridge', 0.015414211276658917),
  ('keywordsbrick', 0.01522975379895767)],
 33: [('insurance', 0.0764419593911723),
  ('fault', 0.031912836910183394),
  ('deductible', 0.02941388019849342),
  ('car', 0.028262795427504157),
  ('pay', 0.025823352308636256),
  ('accident', 0.023572205594853953),
  ('rate', 0.02236804343299877),
  ('sticker', 0.021661316738831073),
  ('company', 0.01923565270501552),
  ('farm', 0.01845286404333179)],
 34: [('card', 0.12265783967540461),
  ('color', 0.029462904250679323),
  ('vram', 0.027648078694579366),
  ('fast', 0.02615799937223588),
  ('bit', 0.025639851222644262),
  ('graphic', 0.024929280624347194),
  ('video', 0.023467914529196582),
  ('orchid', 0.021502900244684556),
  ('monitor', 0.02088095686756272),
  ('performance', 0.020741545503941257)],
 35: [('gordonbank', 0.06075334932347258),
  ('jxpskepticism', 0.03804024234550143),
  ('shameful', 0.0377487688242111),
  ('transplant', 0.03771163897182705),
  ('liver', 0.03771163897182705),
  ('intellect', 0.037196092832077685),
  ('surrender', 0.03419647034628476),
  ('pain', 0.032307976060557235),
  ('computerscience', 0.027388187021648438),
  ('soon', 0.02621621299610939)],
 36: [('countersteere', 0.0546877433841349),
  ('bike', 0.04877682032873622),
  ('rider', 0.038299126675393064),
  ('motorcycle', 0.03067092057082276),
  ('rein', 0.029006019772890773),
  ('steer', 0.028805738148888215),
  ('lean', 0.02791722279023156),
  ('turn', 0.026180429004502258),
  ('swerve', 0.02394755006775863),
  ('technique', 0.02363282505561421)],
 37: [('post', 0.019914128137724876),
  ('funny', 0.01956202537267302),
  ('article', 0.01176354021736683),
  ('joke', 0.011477582915284284),
  ('write', 0.01022437894740424),
  ('pray', 0.010223770705213018),
  ('day', 0.009934511768020602),
  ('lord', 0.009747242912610272),
  ('opinion', 0.009578837151833742),
  ('naive', 0.009268271712870679)],
 38: [('yeast', 0.0450882884467131),
  ('nystatin', 0.029603291817926487),
  ('sinus', 0.0276589295888865),
  ('infection', 0.026219878231877793),
  ('treatment', 0.026079565964799923),
  ('antibiotic', 0.022706408336135315),
  ('symptom', 0.022573598682208276),
  ('oily', 0.021939991749768854),
  ('acne', 0.020484623818328385),
  ('quack', 0.01946874304104085)],
 39: [('cruel', 0.05441532740739915),
  ('deathpenalty', 0.044730854591394247),
  ('innocent', 0.042274675563293125),
  ('murder', 0.037619971493148996),
  ('kill', 0.03392094559051295),
  ('punishment', 0.03184924710187425),
  ('politicalatheist', 0.031399768119205404),
  ('commit', 0.022521894329948895),
  ('system', 0.02128501542026509),
  ('execute', 0.021268459329452434)],
 40: [('greek', 0.08020641653573837),
  ('turkish', 0.03369632413144676),
  ('turk', 0.03269440476658686),
  ('greece', 0.028928514657449222),
  ('turkishminority', 0.017408874429658126),
  ('ethnic', 0.014070776009280528),
  ('government', 0.013156072280788028),
  ('minority', 0.012286470353335341),
  ('armenian', 0.011732204574380851),
  ('book', 0.010918609256852186)],
 41: [('christian', 0.028394179464941192),
  ('liarlunatic', 0.020646137204369155),
  ('liar', 0.018112863391980438),
  ('die', 0.017753213265782383),
  ('religion', 0.017271035186176904),
  ('people', 0.016419194728729356),
  ('prophecy', 0.016180596459817377),
  ('heal', 0.014959658796960899),
  ('christianity', 0.014270679449012029),
  ('bible', 0.013657745636262403)],
 42: [('science', 0.05410192624260946),
  ('methodology', 0.038157365731421235),
  ('scientific', 0.028327819447902473),
  ('hypothesis', 0.02383097150484523),
  ('theory', 0.02219755626013905),
  ('experiment', 0.02002323269457525),
  ('sequence', 0.018712247402045572),
  ('fantasy', 0.01838389230373775),
  ('homeopathytradition', 0.017121392122000326),
  ('protein', 0.016970670662207647)],
 43: [('keyboard', 0.07167842682785401),
  ('key', 0.06853053938161754),
  ('accelerator', 0.059373708257558035),
  ('shift', 0.03731161727041526),
  ('modifier', 0.03168700185618704),
  ('ctrl', 0.027729619834106985),
  ('translation', 0.027684760231644637),
  ('ctrlkey', 0.026232417694033425),
  ('define', 0.02220227625785368),
  ('menu', 0.021250011032195728)],
 44: [('msmyer', 0.030126179171653535),
  ('president', 0.02858272403472137),
  ('job', 0.02243011759524676),
  ('work', 0.014895331907671732),
  ('russian', 0.014188154049630732),
  ('senioradministration', 0.014040473766190413),
  ('go', 0.013976089056400841),
  ('think', 0.013275827602901066),
  ('package', 0.013017902260384111),
  ('official', 0.012623735448998677)],
 45: [('absolute', 0.055735603946341944),
  ('truth', 0.04927651335520828),
  ('arrogance', 0.03242650672704986),
  ('belief', 0.031229075530699573),
  ('arrogant', 0.02572439095961214),
  ('believe', 0.023397014686843617),
  ('authority', 0.02287105825546018),
  ('scripture', 0.021087703793337746),
  ('absolutetruth', 0.020851625308633732),
  ('evidence', 0.01896131869983727)],
 46: [('drug', 0.14797412998928966),
  ('legalization', 0.031063865034182004),
  ('legalize', 0.03079586791079164),
  ('war', 0.024373591850061393),
  ('cocaine', 0.02319757444457153),
  ('wod', 0.022912426135865943),
  ('hypocrisyt', 0.022593196407818015),
  ('cigarette', 0.021073318320382468),
  ('ryanscharfy', 0.02085791894702093),
  ('legal', 0.019062508475203185)],
 47: [('oil', 0.1298915250164159),
  ('changingoil', 0.051355660417275885),
  ('bolt', 0.037605938997090756),
  ('self', 0.02675426684923848),
  ('quart', 0.025098667604793946),
  ('car', 0.02461801024730351),
  ('wrench', 0.02449879960884285),
  ('mile', 0.024309211904860725),
  ('hole', 0.022425222379998385),
  ('cylinder', 0.02225534374119694)],
 48: [('homosexual', 0.04992373870889151),
  ('gay', 0.049510865215272606),
  ('man', 0.04275225543190616),
  ('promiscuous', 0.03971101786225638),
  ('dramatically', 0.03898059444397359),
  ('percent', 0.03716988805769234),
  ('sexualpartner', 0.036165596010046534),
  ('gaypercentage', 0.03364370542019478),
  ('kinseyreport', 0.031811951897593345),
  ('study', 0.030918351109309315)],
 49: [('gateway', 0.03191204922339655),
  ('host', 0.029304789327306065),
  ('nntpposte', 0.02860578960236699),
  ('instal', 0.025599369309447367),
  ('problem', 0.025259484281414432),
  ('register', 0.022907850370720413),
  ('exception', 0.02242729809727298),
  ('erme', 0.021117780923349424),
  ('syst', 0.021117780923349424),
  ('buy', 0.020376923183803512)],
 50: [('image', 0.04820218617393005),
  ('graphic', 0.030770170694722796),
  ('plot', 0.028907678282998957),
  ('plplot', 0.028301876072960665),
  ('package', 0.02494842406231188),
  ('tool', 0.023684549036871522),
  ('library', 0.01940251387718215),
  ('analysis', 0.01759715117867585),
  ('user', 0.017024091142944522),
  ('cad', 0.01658352540038222)],
 51: [('simms', 0.09951154755118824),
  ('simm', 0.07266740424919722),
  ('memory', 0.06519869038972927),
  ('ram', 0.05024463905132314),
  ('chip', 0.03289808623771293),
  ('dram', 0.030797289918589883),
  ('refresh', 0.029527133112702397),
  ('meg', 0.027543335994726213),
  ('pinsimms', 0.024864540744881664),
  ('cycle', 0.02240759020291665)],
 52: [('medicine', 0.032484360396296416),
  ('psychoactive', 0.03148644825088931),
  ('prozac', 0.03148644825088931),
  ('disease', 0.0298138270149792),
  ('patient', 0.028647930533805828),
  ('effect', 0.02807983170126024),
  ('drug', 0.027244423603769878),
  ('placebo', 0.026634070236029473),
  ('gr', 0.02392555784869406),
  ('ronroth', 0.02361483618816698)],
 53: [('crosslinke', 0.09103812154657334),
  ('allocationunit', 0.0762407154427419),
  ('window', 0.0501502064489674),
  ('cfg', 0.043896714543846846),
  ('gfxvpic', 0.043896714543846846),
  ('cluster', 0.03520390471102671),
  ('exe', 0.03195647346637822),
  ('crash', 0.02943622943163868),
  ('keepscrashe', 0.027670297416741537),
  ('file', 0.02620222294963896)],
 54: [('widget', 0.023471939484907892),
  ('available', 0.019515194089830674),
  ('server', 0.016024083445308947),
  ('pub', 0.015576587152263783),
  ('application', 0.014169811390045937),
  ('version', 0.013075610505694363),
  ('include', 0.012838382673676605),
  ('file', 0.01198965835660298),
  ('graphic', 0.01154331465007485),
  ('resource', 0.010963987745404517)],
 55: [('hell', 0.053668898828743775),
  ('atheist', 0.03254297234343234),
  ('eternal', 0.02702139940619704),
  ('eternaldeath', 0.026692348569234574),
  ('believe', 0.02456989015379998),
  ('die', 0.021859269756740133),
  ('resurrection', 0.01752814454652577),
  ('body', 0.016582288188248105),
  ('human', 0.014571867421844086),
  ('death', 0.014529554382306941)],
 56: [('trial', 0.06933193793065276),
  ('cooper', 0.03981411372892238),
  ('witness', 0.038398387794035094),
  ('weaver', 0.03425271512538571),
  ('verdict', 0.02684055124228059),
  ('plaintiff', 0.024259144848878783),
  ('spence', 0.02379864135436258),
  ('new', 0.0225954340312973),
  ('jury', 0.021438631137387015),
  ('court', 0.020466096177050152)],
 57: [('coolingtower', 0.07288291755418987),
  ('water', 0.0614437688597253),
  ('plant', 0.05814777816846494),
  ('steam', 0.05406956241738616),
  ('uranium', 0.04376031214404917),
  ('cool', 0.03553702423484041),
  ('nuclear', 0.03358308971889222),
  ('reactor', 0.03270638832896486),
  ('nuclearsite', 0.030149828468382135),
  ('energy', 0.029265507822705518)],
 58: [('game', 0.08557625873482862),
  ('segagenesis', 0.04222881997500058),
  ('genesis', 0.039103131942929664),
  ('sale', 0.038643576555622006),
  ('controller', 0.028923834170049845),
  ('trade', 0.028495643348901954),
  ('sne', 0.02562278751332727),
  ('nintendo', 0.023115580790410272),
  ('docsdisk', 0.02268380585287087),
  ('super', 0.0219612971693762)],
 59: [('radardetector', 0.13409420428748786),
  ('detector', 0.0932187533897058),
  ('radar', 0.08863829691361524),
  ('beam', 0.03552994932171792),
  ('car', 0.029970765672937005),
  ('detect', 0.02886151305258383),
  ('police', 0.026993025698523552),
  ('speedometer', 0.02538816804091439),
  ('receiver', 0.024909038491534724),
  ('radio', 0.023712380038912465)],
 60: [('mswindow', 0.06492447422990312),
  ('window', 0.04743095081040116),
  ('icon', 0.0442483869217622),
  ('manager', 0.02760165168014481),
  ('cursor', 0.023475291523754566),
  ('delete', 0.022915603971568433),
  ('group', 0.020712085401661),
  ('program', 0.02047556753127674),
  ('finetune', 0.018691456022765594),
  ('version', 0.01724501878966719)],
 61: [('nazi', 0.052317048851381255),
  ('hitler', 0.03804763311521923),
  ('german', 0.025630914364762),
  ('limbaugh', 0.025047608747233184),
  ('party', 0.016446732327856383),
  ('chancellor', 0.014967845385512947),
  ('homosexual', 0.013368280441032816),
  ('history', 0.01331805065261802),
  ('side', 0.012989095524758307),
  ('himmler', 0.010485524446881665)],
 62: [('mouse', 0.23851835429973228),
  ('com', 0.06156532114019581),
  ('movesmoothly', 0.0417332228791408),
  ('driver', 0.03519122166834939),
  ('jump', 0.025469722063506575),
  ('mousejump', 0.023054509302947848),
  ('verticalmotion', 0.023054509302947848),
  ('horizontalmotion', 0.023054509302947848),
  ('apple', 0.022216549892712024),
  ('click', 0.021025141500104687)],
 63: [('compile', 0.0796015226439554),
  ('libxmu', 0.05489135720333669),
  ('symbol', 0.048397410910518315),
  ('error', 0.04784893378983467),
  ('explorationproduct', 0.041276765753900206),
  ('suno', 0.03695305611357901),
  ('sug', 0.03484990110252485),
  ('makefile', 0.032599081850764475),
  ('undefineddoug', 0.032304496392432436),
  ('problem', 0.030542806810446077)],
 64: [('worship', 0.06839601820241566),
  ('sabbath', 0.059975806211859446),
  ('law', 0.05790627125061493),
  ('gentile', 0.041623743449504834),
  ('day', 0.03546778747298005),
  ('ceremonial', 0.030572728169981418),
  ('christian', 0.030041613802848123),
  ('sabbathadmission', 0.022182511923863574),
  ('paul', 0.021989876785142658),
  ('jewish', 0.02079214569943989)],
 65: [('tape', 0.10519664310014572),
  ('disk', 0.07546218013214138),
  ('drive', 0.04996700444087918),
  ('backup', 0.043972820289968816),
  ('floptical', 0.03644681802474545),
  ('hole', 0.02677339478686914),
  ('floppy', 0.02501168138865748),
  ('nilaypatel', 0.024045612379201022),
  ('marker', 0.021891861237246055),
  ('optical', 0.020431414543603546)],
 66: [('modem', 0.16014432693622221),
  ('baud', 0.036664227757766316),
  ('fax', 0.03487527990342778),
  ('string', 0.03474273061031691),
  ('firstclass', 0.0289159128547647),
  ('robotic', 0.028742977572820995),
  ('setting', 0.027307119651124166),
  ('duo', 0.026040527868362136),
  ('cable', 0.02246651266204908),
  ('warranty', 0.021998128402039536)],
 67: [('seizure', 0.12954213075170964),
  ('corn', 0.10896733846046479),
  ('cereal', 0.07769878444433499),
  ('food', 0.0381151164874379),
  ('diet', 0.03672753242847102),
  ('relatedseizure', 0.03411954898348687),
  ('infantilespasm', 0.031183827588471737),
  ('kellog', 0.028192241361637397),
  ('disorder', 0.025854451841505456),
  ('sugarcoate', 0.025137758295453703)],
 68: [('resurrection', 0.04569956976703852),
  ('rise', 0.04317784183430807),
  ('impact', 0.023389371446158578),
  ('jewish', 0.022445014345734402),
  ('body', 0.021458645041361177),
  ('roman', 0.021366574334352312),
  ('lie', 0.018037757406375154),
  ('emery', 0.017151377152668432),
  ('believe', 0.016447551751553324),
  ('lukesaccount', 0.016020186238275363)],
 69: [('helmet', 0.2192452140550481),
  ('shoei', 0.03689432359065039),
  ('liner', 0.03250449907851934),
  ('bike', 0.029274771898348848),
  ('impact', 0.02899516534932512),
  ('passenger', 0.027333387810211055),
  ('primaryconcern', 0.026278840408206876),
  ('damage', 0.0261579014466072),
  ('seat', 0.02612500438594434),
  ('size', 0.0258145348932906)],
 70: [('sale', 0.06236273733803827),
  ('disk', 0.046820084166751125),
  ('drive', 0.03426508786611373),
  ('apple', 0.03218318382169636),
  ('include', 0.030190813564505516),
  ('manual', 0.029368798055769568),
  ('rodneyjack', 0.028336901725759197),
  ('dbase', 0.027738696948492323),
  ('card', 0.025791692577726268),
  ('commodore', 0.025647699109219543)],
 71: [('lens', 0.08970656448815056),
  ('camera', 0.08757099533925322),
  ('projector', 0.07271553813203718),
  ('lense', 0.061320954981221525),
  ('sale', 0.038243402868646956),
  ('sell', 0.03435626675112036),
  ('zoom', 0.0331572768913827),
  ('price', 0.032168999079498446),
  ('strap', 0.031601162454873835),
  ('video', 0.02866195944839994)],
 72: [('monitor', 0.06605113540711696),
  ('color', 0.057121203968264825),
  ('screen', 0.053194548816159724),
  ('video', 0.049385537724491516),
  ('problem', 0.047896995238572854),
  ('apple', 0.037616708363021716),
  ('window', 0.03540509600676232),
  ('scrolling', 0.02867528112504405),
  ('accummulate', 0.025932596354843414),
  ('horizontal', 0.024963351802307934)],
 73: [('claytoncramer', 0.05023087531397783),
  ('homosexual', 0.03968014597556204),
  ('sexualorientation', 0.03954729331761143),
  ('gay', 0.03606032721252935),
  ('optilinkcramer', 0.028666050348415302),
  ('rape', 0.026487517236246802),
  ('professor', 0.026108567481876125),
  ('sexual', 0.025376079310436686),
  ('female', 0.025196972268290863),
  ('minerelation', 0.024837564301247683)],
 74: [('tiff', 0.09237370610575327),
  ('tiffphilosophical', 0.06058137909392974),
  ('significance', 0.04793757090004193),
  ('douglasadam', 0.024251000463952538),
  ('spec', 0.022941550480138122),
  ('gripe', 0.020018045720321037),
  ('alice', 0.019541611974000447),
  ('tully', 0.016725279836805617),
  ('philosophicalsignificance', 0.016725279836805617),
  ('question', 0.016303598351641264)],
 75: [('phone', 0.10210800190203267),
  ('number', 0.09404855457129553),
  ('ozone', 0.0653247791923949),
  ('dial', 0.058929311336368556),
  ('jackmounte', 0.05384082732072073),
  ('greetingssituation', 0.05384082732072073),
  ('operator', 0.0495486430997516),
  ('find', 0.03630331664033296),
  ('line', 0.03338570327010134),
  ('trace', 0.0332648414068542)],
 76: [('openwindow', 0.03802976048693809),
  ('window', 0.03736796236953475),
  ('problem', 0.032794036795374445),
  ('uart', 0.03016031775935999),
  ('server', 0.029883556853598416),
  ('com', 0.028733509371414327),
  ('port', 0.02258604944013815),
  ('card', 0.021075656380948573),
  ('run', 0.020971288012049002),
  ('patch', 0.01935440151817624)],
 77: [('window', 0.13197824497757174),
  ('windowmanag', 0.11056598151210643),
  ('position', 0.06993633627496178),
  ('decoration', 0.056247722081152064),
  ('selepntr', 0.05428177414211533),
  ('specificcoordinate', 0.04648204310147628),
  ('specify', 0.04255225362670293),
  ('tomlastrange', 0.03718563448118103),
  ('sibling', 0.03718563448118103),
  ('tobiasdope', 0.034245891266888644)],
 78: [('marriage', 0.13011600368378642),
  ('marry', 0.10224895746902628),
  ('married', 0.06032505879775998),
  ('ceremony', 0.04894646352438059),
  ('divorce', 0.037957548995177344),
  ('wedding', 0.03544753372540982),
  ('commitment', 0.03454046130980365),
  ('church', 0.030893054566486947),
  ('couple', 0.029120207262619765),
  ('priest', 0.02461147698747508)],
 79: [('widget', 0.1455654734442187),
  ('gl', 0.10612163196220521),
  ('gadget', 0.042363728689633305),
  ('xmdrawingarea', 0.041117599711889316),
  ('application', 0.03903441883484099),
  ('circular', 0.03642390772606996),
  ('glxmdraw', 0.0360037114854424),
  ('motif', 0.03503352035645648),
  ('athenawidget', 0.031169663805091858),
  ('ibmrs', 0.028821118456155828)],
 80: [('mormon', 0.07755856950436148),
  ('religion', 0.021300610235883883),
  ('secularauthoritie', 0.02061833682713054),
  ('ld', 0.019342948414348915),
  ('church', 0.018270086033943894),
  ('casperknie', 0.018253375022232024),
  ('sect', 0.017188112268386214),
  ('persecution', 0.015623764452659741),
  ('peteyadlowsky', 0.015500682431069387),
  ('rld', 0.013371512786196155)],
 81: [('driver', 0.24607063821299033),
  ('videocard', 0.07190173992678746),
  ('card', 0.06487287754592932),
  ('color', 0.06292496266022757),
  ('wong', 0.05320813884102648),
  ('dualpage', 0.04445593861384696),
  ('wak', 0.04445593861384696),
  ('ftpsite', 0.04365728718791773),
  ('window', 0.04277683153896209),
  ('speedstar', 0.042658650636166244)],
 82: [('date', 0.055123551967192076),
  ('timer', 0.05323847922821745),
  ('timing', 0.04538381666131422),
  ('snow', 0.03822889705131632),
  ('menu', 0.03575517057251557),
  ('pellet', 0.035279842686016215),
  ('ultra', 0.033243778817137686),
  ('battery', 0.03167888338815815),
  ('clock', 0.030856566838408196),
  ('crystal', 0.0271672116894724)],
 83: [('cop', 0.06575911792088596),
  ('ticket', 0.04697413018421206),
  ('intoxicated', 0.03247070953500257),
  ('speedymercer', 0.029662652921589202),
  ('liquor', 0.028862852920002287),
  ('officer', 0.02848120333080181),
  ('court', 0.0284364324913694),
  ('dwi', 0.022358200984169373),
  ('drunkdrive', 0.021982857725844508),
  ('speed', 0.021567350906782023)],
 84: [('coolant', 0.04808196424024301),
  ('heat', 0.03984173637128635),
  ('substitute', 0.039684509533410864),
  ('airconditione', 0.03769983189520441),
  ('freon', 0.029951394651356156),
  ('oven', 0.028940941272821884),
  ('pump', 0.02645389371224814),
  ('peltiereffect', 0.024818885014908143),
  ('retrofit', 0.023841142886180604),
  ('air', 0.02356229725082875)],
 85: [('deficit', 0.06589492600586329),
  ('tax', 0.0653150467867),
  ('vat', 0.05879470504369579),
  ('taxis', 0.048375017779792535),
  ('capitalgain', 0.031164526247537707),
  ('economic', 0.02851698363387834),
  ('investor', 0.025617006546753633),
  ('spending', 0.02539515440180188),
  ('revenue', 0.022676799611918194),
  ('rate', 0.02226519874314681)],
 86: [('ax', 0.05928273684496523),
  ('cj', 0.035578788558542095),
  ('rk', 0.03064514345397361),
  ('sj', 0.022996362074659098),
  ('rlk', 0.0213248489623028),
  ('lhz', 0.02124579702943873),
  ('cx', 0.02026565562133745),
  ('japanese', 0.01922259969654832),
  ('ai', 0.016100430090612076),
  ('vz', 0.015020523831344243)],
 87: [('henrik', 0.05555200374031482),
  ('plane', 0.049832710685590365),
  ('turkishplane', 0.04866347231674578),
  ('armenian', 0.04654495675582742),
  ('azeris', 0.041031409683375215),
  ('shoot', 0.03538534292167865),
  ('homeland', 0.03304445385150035),
  ('forge', 0.03156661752883352),
  ('search', 0.0312195600188601),
  ('turkish', 0.030933303825198055)],
 88: [('jazz', 0.06913934342690123),
  ('sale', 0.06888077579873401),
  ('rollingstone', 0.06307043023062767),
  ('rpmsingle', 0.05211372497053604),
  ('vinyl', 0.044949309539128464),
  ('capitolpicture', 0.0406882092623587),
  ('music', 0.038485348487042374),
  ('sleeve', 0.0377001008006893),
  ('promopicture', 0.03475353526364703),
  ('cd', 0.03383899352105777)],
 89: [('motto', 0.12238060257950377),
  ('pompousass', 0.04237904933509004),
  ('thing', 0.03296994081762076),
  ('little', 0.0277923402145253),
  ('change', 0.026147577786321718),
  ('populationgrowth', 0.023878882099984296),
  ('coin', 0.023258447621310026),
  ('farzinmokhtarian', 0.021680482378181522),
  ('schneider', 0.021112941032553564),
  ('freedom', 0.020915221618223692)],
 90: [('dog', 0.22084225914840827),
  ('chase', 0.0404788849679357),
  ('bike', 0.03859342768956505),
  ('ride', 0.029784723797756978),
  ('driveway', 0.029375218086413572),
  ('road', 0.019816632653833644),
  ('encounter', 0.0197278704014323),
  ('territory', 0.018900135406056087),
  ('attack', 0.018786511578655164),
  ('dispense', 0.018596945557767572)],
 91: [('adjective', 0.03799328913126648),
  ('white', 0.03771218538748583),
  ('black', 0.03756651516176531),
  ('whitemale', 0.032935839932292446),
  ('redneck', 0.025335261486378803),
  ('africanamerican', 0.023222658987323806),
  ('male', 0.022030121233995184),
  ('loser', 0.021977092972418153),
  ('large', 0.020565000774837194),
  ('rodneyke', 0.01798526458465361)],
 92: [('godshape', 0.055344082298722036),
  ('heart', 0.03910434523124264),
  ('christianity', 0.03181851405658532),
  ('hole', 0.031404109370420576),
  ('peoplesspiritual', 0.02703776507125993),
  ('life', 0.026915970212135914),
  ('atheist', 0.02556230726124682),
  ('infectious', 0.024110662739863568),
  ('drug', 0.02366517854553703),
  ('christian', 0.022213638802563697)],
 93: [('pop', 0.08461398597439444),
  ('popup', 0.0635093995176973),
  ('dialogbox', 0.05973571503071978),
  ('button', 0.05691262623823576),
  ('window', 0.053209793998622266),
  ('dialog', 0.048765005087910644),
  ('event', 0.03381516898137644),
  ('time', 0.03266626590104498),
  ('application', 0.03179351876098632),
  ('program', 0.03151406054730094)],
 94: [('meat', 0.06621180864303135),
  ('smoke', 0.06519452498353083),
  ('carcinogenic', 0.059866362817070064),
  ('barbecuedfood', 0.05422362115449456),
  ('healthrisk', 0.05275141661921509),
  ('charcoal', 0.04152590713153676),
  ('barbecue', 0.04108247350208514),
  ('wood', 0.03487352132203682),
  ('food', 0.03416374035544622),
  ('carcinogen', 0.03230720445555953)],
 95: [('font', 0.1946849495098767),
  ('alavi', 0.05987634014930445),
  ('character', 0.05983101735600206),
  ('window', 0.05894972757941979),
  ('xterm', 0.035833034671444844),
  ('spacify', 0.03446081147583439),
  ('change', 0.030624038177386334),
  ('disappear', 0.029007190816133253),
  ('text', 0.02835149601784968),
  ('trivial', 0.027783461385210026)],
 96: [('homosexuality', 0.06874318355645229),
  ('gay', 0.06462327650350864),
  ('homosexual', 0.04455533205055673),
  ('sex', 0.027083092817949173),
  ('sin', 0.020229312449639485),
  ('people', 0.016284034304038946),
  ('lesbian', 0.015093933339820913),
  ('community', 0.01338111412229549),
  ('christian', 0.013262660113294683),
  ('church', 0.01298454697426851)],
 97: [('joystick', 0.13261781512706838),
  ('int', 0.0650652983391651),
  ('arcadestyle', 0.052517114905709227),
  ('game', 0.046989680849580016),
  ('joystickport', 0.040846644926662734),
  ('gamecard', 0.032153743717699225),
  ('button', 0.031049923773138942),
  ('read', 0.029566369506340045),
  ('augment', 0.028393304043955427),
  ('atari', 0.025096310212839187)],
 98: [('doctor', 0.10561250490212033),
  ('ultrasound', 0.0838195449880376),
  ('radiologist', 0.07849299820762998),
  ('clinic', 0.04425537370213476),
  ('apology', 0.029955408950887736),
  ('patient', 0.02837734795497413),
  ('prostate', 0.026330939175571368),
  ('medical', 0.025896871186714684),
  ('wife', 0.023653927870514586),
  ('receptionist', 0.02314648043530875)],
 99: [('bus', 0.12729742916987216),
  ('idecontroller', 0.07083387786031094),
  ('mhz', 0.07080966137616336),
  ('speed', 0.06859442502958821),
  ('localbus', 0.062238432330111164),
  ('controller', 0.054084506859578836),
  ('slow', 0.04111529566302196),
  ('ram', 0.03693448843821766),
  ('memory', 0.0317688263248472),
  ('card', 0.030915213959103988)],
 100: [('translation', 0.021866289024248125),
  ('hebrew', 0.01881806381398455),
  ('greek', 0.01875415359910381),
  ('early', 0.017411188040827846),
  ('text', 0.01731113668209478),
  ('hang', 0.016763728354137388),
  ('word', 0.015064694764207702),
  ('inerrant', 0.014956317581484383),
  ('book', 0.014731040975354161),
  ('language', 0.014718604778796069)],
 101: [('newsgateway', 0.14940457215551722),
  ('utexas', 0.13188503350382513),
  ('prolineinternet', 0.1130531614280177),
  ('uucpuunet', 0.07936720471405528),
  ('trinomial', 0.06351728937375349),
  ('mail', 0.053146258432030656),
  ('atm', 0.04980152405183907),
  ('host', 0.04379320064054297),
  ('prcgs', 0.038686232980218456),
  ('nntpposte', 0.03861165000587071)],
 102: [('driver', 0.06469877895836006),
  ('card', 0.0641229020829706),
  ('protectionfault', 0.05119318299431993),
  ('atiultra', 0.04816409330151821),
  ('window', 0.043518626305533686),
  ('gateway', 0.04320233840638738),
  ('gatewaydx', 0.04310059577960423),
  ('flex', 0.04133139596981568),
  ('experiencedfaint', 0.035389171581685086),
  ('atis', 0.035389171581685086)],
 103: [('order', 0.08831709994874071),
  ('orientaltemplar', 0.0553271917284988),
  ('rosicrucianord', 0.05257236007042261),
  ('ancient', 0.05036543952658032),
  ('tonyalicea', 0.04596305052181607),
  ('orientis', 0.035964127316645805),
  ('reuss', 0.03541488998207717),
  ('goldendawn', 0.03261513453989394),
  ('ordotempli', 0.027955829605623376),
  ('spinoff', 0.0244874366533497)],
 104: [('migraine', 0.16223990073890543),
  ('pain', 0.09834838866464185),
  ('headache', 0.056667075255279656),
  ('exercise', 0.04143613255490441),
  ('gordonbank', 0.038491299707130326),
  ('analgesic', 0.03630595257654058),
  ('patient', 0.030063306714025125),
  ('leg', 0.02804293186847255),
  ('tennis', 0.026679497611496007),
  ('dn', 0.026679497611496007)],
 105: [('wire', 0.07834341561141636),
  ('ground', 0.0662355416562859),
  ('wiring', 0.06547330376509235),
  ('outlet', 0.056675975611596804),
  ('neutral', 0.0563862526847771),
  ('circuit', 0.047885800899032015),
  ('gfci', 0.03532001646091766),
  ('breaker', 0.02882751075835729),
  ('panel', 0.027437407192693306),
  ('electrical', 0.02704972616731253)],
 106: [('ch', 0.1072145669907265),
  ('aspect', 0.08944348470821777),
  ('group', 0.08676093860072716),
  ('splitpersonally', 0.07368947430042457),
  ('wate', 0.0688543790160629),
  ('graphic', 0.05962494805940736),
  ('convenience', 0.05488491745866772),
  ('forum', 0.05068458234483233),
  ('michaelnerone', 0.048717956618850373),
  ('favor', 0.04303065330717183)],
 107: [('sharedmemory', 0.09984862267196293),
  ('server', 0.08077634616320065),
  ('animation', 0.07149458451622395),
  ('xputimage', 0.06938048157902864),
  ('pixmap', 0.06470934452021422),
  ('segment', 0.040227316283342356),
  ('client', 0.03982257187611942),
  ('extension', 0.03869337527026248),
  ('xview', 0.038520568497325824),
  ('sunview', 0.03603883125581173)],
 108: [('line', 0.12278415264632199),
  ('calibra', 0.12117262298349488),
  ('hoi', 0.1119321674227654),
  ('nunnery', 0.10652747265778488),
  ('spec', 0.10244445087130769),
  ('oakland', 0.09971947718315788),
  ('netlander', 0.09971947718315788),
  ('thee', 0.09345728011382798),
  ('fli', 0.09188834529403005),
  ('crush', 0.07939914479165967)],
 109: [('video', 0.13384789737590141),
  ('tape', 0.06879519563935993),
  ('vcr', 0.06603722014903728),
  ('tv', 0.04886008328086996),
  ('copy', 0.044028453454146355),
  ('quicktime', 0.04203231102340678),
  ('react', 0.040761481634967164),
  ('protection', 0.035972707943782004),
  ('frame', 0.03378836666370587),
  ('ntsc', 0.030754108677866147)],
 110: [('fifthamendment', 0.06262115616779944),
  ('password', 0.06102458469881448),
  ('key', 0.05188237823383673),
  ('compel', 0.03443518674786847),
  ('disclosure', 0.031964251785412505),
  ('copyright', 0.030679417273406895),
  ('private', 0.028556812483467545),
  ('peanutsstrip', 0.02763921285751134),
  ('keyphrase', 0.02763921285751134),
  ('reveal', 0.02700120333854981)],
 111: [('wheelie', 0.2354158180022329),
  ('shaft', 0.19715198950698468),
  ('shaftdrive', 0.10323000158279452),
  ('motorcycle', 0.06055770780867878),
  ('grind', 0.05463837647054193),
  ('splitfire', 0.05428070890125142),
  ('frontwheel', 0.04832837234959504),
  ('clutch', 0.0469165653787704),
  ('effect', 0.045750525229521215),
  ('imposible', 0.04320892771303779)],
 112: [('drink', 0.11303233811805648),
  ('ride', 0.10884622443419721),
  ('alcohol', 0.054914861941991654),
  ('drinking', 0.053539043362862346),
  ('drinktonight', 0.041722000050816055),
  ('cyclingcouple', 0.041722000050816055),
  ('sobriety', 0.04102158272054912),
  ('drunk', 0.04098137921906867),
  ('hour', 0.040119721154805964),
  ('drinkshour', 0.03722832752236221)],
 113: [('abortion', 0.09125053945368067),
  ('fetus', 0.04677938990132528),
  ('child', 0.036969498865850325),
  ('human', 0.03351139557917321),
  ('parent', 0.03202875352752832),
  ('larrymargoli', 0.026112614209208993),
  ('premium', 0.025905144275951048),
  ('life', 0.025518196299452102),
  ('coverage', 0.024623482397489023),
  ('womb', 0.02455463786058787)],
 114: [('duo', 0.0777230040968972),
  ('problem', 0.05560793027207993),
  ('freeze', 0.04735561096841055),
  ('apple', 0.044727514705815145),
  ('sleep', 0.03935526181262343),
  ('reboot', 0.03432626483969979),
  ('occasionally', 0.02862217330951077),
  ('reset', 0.027830694997456967),
  ('pram', 0.0270751472889907),
  ('software', 0.027004449927072682)],
 115: [('nickpettefar', 0.10167583903310827),
  ('uknewsreader', 0.07767086531116178),
  ('ltdmaidenhead', 0.07767086531116178),
  ('unitedkingdom', 0.0731611909064114),
  ('tinversion', 0.06943656784302209),
  ('incarcerate', 0.06906360985559767),
  ('bnrmaidenhead', 0.05220928709888071),
  ('pettefarcurrently', 0.0455149091463448),
  ('conciseoxford', 0.038651052295653535),
  ('gmtwibble', 0.03729234792777193)],
 116: [('ancient', 0.05578348760706917),
  ('document', 0.03051717672089392),
  ('medievalperiod', 0.029333529627255744),
  ('book', 0.026916162822524257),
  ('lewis', 0.02680721760550181),
  ('mystery', 0.025954293114076647),
  ('copy', 0.025012742208721062),
  ('harrison', 0.024971773423624208),
  ('rhetoric', 0.02428126890419976),
  ('argument', 0.022949332444613446)],
 117: [('odometer', 0.1430687919288719),
  ('mileage', 0.053786130836405655),
  ('car', 0.04743518934987012),
  ('electronicodometer', 0.039769055367586376),
  ('sensor', 0.038766876055023956),
  ('reading', 0.032094102259624877),
  ('dealer', 0.027127906414111093),
  ('pulse', 0.026950847135860604),
  ('mile', 0.025652162021292002),
  ('oxygensensor', 0.024438389705859053)],
 118: [('lanworkplace', 0.055879838447086036),
  ('os', 0.04575337853620259),
  ('chicogo', 0.036826423862716805),
  ('window', 0.034611026569177736),
  ('do', 0.0334730383691935),
  ('client', 0.026560616655507435),
  ('app', 0.025958148909901703),
  ('seperate', 0.025934283700103578),
  ('multithreade', 0.025557801031113814),
  ('wfwg', 0.025557801031113814)],
 119: [('moa', 0.10087376321944605),
  ('bmwmoa', 0.09668905379783674),
  ('member', 0.041187025869505844),
  ('membership', 0.041038032666841306),
  ('politic', 0.040259792367748994),
  ('davidkarr', 0.0333549622253739),
  ('humor', 0.03296744690146493),
  ('lapse', 0.032577563507232545),
  ('signing', 0.030299368060697662),
  ('barnacle', 0.030263486550507066)],
 120: [('weight', 0.12518093688110332),
  ('diet', 0.08797093208120046),
  ('cycle', 0.08295989320790818),
  ('chuckforsberg', 0.08078482511331075),
  ('wakgx', 0.07090846929531068),
  ('obesity', 0.06762644803201298),
  ('gordonbank', 0.059294400993931975),
  ('obesityresearcher', 0.042177758349034644),
  ('oraltradition', 0.04128736684529927),
  ('weightgain', 0.0377130213071537)],
 121: [('serb', 0.057345747872705743),
  ('moslem', 0.048472229348799516),
  ('ethniccleanse', 0.040037630019952176),
  ('serbiangenocide', 0.039787867598129224),
  ('god', 0.0357365414271381),
  ('work', 0.02849615524022826),
  ('war', 0.026333625253330283),
  ('bosnia', 0.02414939470447389),
  ('judgement', 0.023205352044012733),
  ('uncomfortable', 0.023062855348442823)],
 122: [('eye', 0.11380356836595809),
  ('handedness', 0.09413049245131667),
  ('eyedominance', 0.09057673625080101),
  ('rk', 0.06846916705285444),
  ('contactlense', 0.05445438951105747),
  ('eyedness', 0.05014944959845277),
  ('prk', 0.04921271019033068),
  ('dominant', 0.046801778927292724),
  ('lenscorrection', 0.0439122499160461),
  ('richardsilver', 0.040267889739326615)],
 123: [('americanoccupied', 0.12106428849800348),
  ('failedpresident', 0.11308440106817347),
  ('replacedjimmy', 0.11308440106817347),
  ('redundancydepartment', 0.11308440106817347),
  ('georgebush', 0.10938243414394261),
  ('carter', 0.10626976966564901),
  ('tenyear', 0.09662449018769294),
  ('opinion', 0.08586687313568053),
  ('employer', 0.07811672377092033),
  ('standard', 0.05086016147619553)],
 124: [('hiramcollege', 0.12709669737018062),
  ('package', 0.07737522749068376),
  ('voucher', 0.05944740488508054),
  ('sale', 0.05908997759864061),
  ('vhsmovie', 0.056946470671591594),
  ('wovie', 0.0549446256029223),
  ('dance', 0.051255391584154805),
  ('douglaskou', 0.04986391021218447),
  ('hirambhiram', 0.04986391021218447),
  ('beta', 0.04968676131750533)],
 125: [('lyme', 0.1591425594615977),
  ('treat', 0.057979370142165254),
  ('physician', 0.05550517572770537),
  ('patient', 0.051981382544939995),
  ('gordonbank', 0.0517641616751063),
  ('lymedisease', 0.050648132132180786),
  ('poo', 0.045661081869238264),
  ('diagnose', 0.044713724297398796),
  ('culture', 0.03805944843235583),
  ('ld', 0.0333173745606279)],
 126: [('selectiveservice', 0.058882392902826),
  ('securityadmistration', 0.03999920099972593),
  ('drafteesfinally', 0.03999920099972593),
  ('volunteerarmy', 0.03999920099972593),
  ('utterwaste', 0.03999920099972593),
  ('irssocial', 0.03999920099972593),
  ('naval', 0.03927264178220903),
  ('abolish', 0.03838667193226359),
  ('motorvehicle', 0.03747061366543473),
  ('agree', 0.03728230977917745)],
 127: [('virtualreality', 0.04401526794244309),
  ('client', 0.03365439365099483),
  ('diaspar', 0.029042988414331828),
  ('model', 0.028129811990830908),
  ('svr', 0.02699920017358929),
  ('multiverse', 0.023775083808681603),
  ('operation', 0.023255996477202816),
  ('object', 0.021599140254575554),
  ('virtual', 0.0211917038262767),
  ('provide', 0.020875158163656135)],
 128: [('image', 0.07166180530040553),
  ('sphinx', 0.0644993041215123),
  ('spect', 0.042642104994902494),
  ('imageprocessing', 0.03733140853822152),
  ('imaging', 0.03519547851063035),
  ('package', 0.03179196243899167),
  ('input', 0.0298535540411922),
  ('signal', 0.029087717143647634),
  ('analysis', 0.028777698562541757),
  ('aprs', 0.02843711326515909)],
 129: [('lucifer', 0.09571104389245019),
  ('logically', 0.06453763651234762),
  ('evil', 0.05820482385172957),
  ('therefore', 0.05712578677781744),
  ('jehovahswitnesse', 0.05125617767725834),
  ('mercede', 0.04938418590548362),
  ('syllogism', 0.04057307088256163),
  ('free', 0.03814360032646522),
  ('omniscient', 0.03706224173802248),
  ('omc', 0.036821289458618976)],
 130: [('environment', 0.06665621687065498),
  ('command', 0.06593140574756158),
  ('file', 0.06265146583030211),
  ('bat', 0.05819599076331669),
  ('exitcode', 0.05555902686078422),
  ('do', 0.04630539952526526),
  ('set', 0.04296906648423895),
  ('appdefault', 0.040955979172196635),
  ('window', 0.03970971100349082),
  ('pif', 0.039698858280018075)],
 131: [('exposeevent', 0.12341406333172053),
  ('handler', 0.09663174065448546),
  ('rectangle', 0.07259098588229142),
  ('item', 0.06556231605807203),
  ('window', 0.05871585094282753),
  ('draw', 0.051805528284627485),
  ('map', 0.05071530366197182),
  ('button', 0.048845880084468433),
  ('mapped', 0.043477392391006446),
  ('callxcopyarea', 0.042665280051964946)],
 132: [('gateway', 0.07681549515691719),
  ('tape', 0.02632550132810915),
  ('service', 0.02364439045101329),
  ('lbl', 0.0232673726424468),
  ('dealer', 0.023085838125324262),
  ('order', 0.021066926486735398),
  ('peer', 0.018378204064274303),
  ('wawbu', 0.018166188035808283),
  ('retail', 0.018106335538132335),
  ('controller', 0.01780075968506968)],
 133: [('easter', 0.13987951466947263),
  ('resurrection', 0.08160773944931027),
  ('celebration', 0.07600518365138578),
  ('celebrate', 0.07205434146307106),
  ('ishtar', 0.061595196020180584),
  ('word', 0.047048690869164606),
  ('objection', 0.03828110426347549),
  ('french', 0.034013723094299717),
  ('pagangoddess', 0.0317941272086405),
  ('name', 0.02927122632751491)],
 134: [('holocaustmemorial', 0.13057628797003842),
  ('dangerousmistake', 0.12504625376964829),
  ('museumcostly', 0.11486088672119268),
  ('monument', 0.04932016069434463),
  ('tax', 0.04579647000925824),
  ('federal', 0.04098517393110056),
  ('exmpt', 0.04038315760657041),
  ('jackschmidle', 0.03948829066409946),
  ('educate', 0.03755267847702901),
  ('private', 0.035822455054855186)],
 135: [('blast', 0.071927616987391),
  ('properequipment', 0.06905673360850827),
  ('batf', 0.06662798298797837),
  ('compound', 0.05936350629959883),
  ('megafire', 0.04923606336780554),
  ('goodfoke', 0.04431245703102498),
  ('protect', 0.041845145903825744),
  ('armoredtransport', 0.040568805869328935),
  ('country', 0.0380209559219227),
  ('wod', 0.0379060294971259)],
 136: [('interrupt', 0.18878557592779646),
  ('port', 0.14286143551042427),
  ('com', 0.13556722563829707),
  ('mouse', 0.06182362257397565),
  ('modem', 0.04837481781132188),
  ('card', 0.04402969065379614),
  ('serialport', 0.04379354961948205),
  ('conflict', 0.04342578832399484),
  ('printer', 0.037099095809735666),
  ('pc', 0.031687844479500946)],
 137: [('adl', 0.1294064044322726),
  ('spy', 0.0691654920241457),
  ('aren', 0.058326283516843),
  ('gerard', 0.043796214653130675),
  ('police', 0.03609988324778664),
  ('yigal', 0.0358353198579558),
  ('information', 0.033734992140672625),
  ('consideryigal', 0.03097768986891098),
  ('yigalaren', 0.027184802858906624),
  ('confidential', 0.027184165297046307)],
 138: [('motherboard', 0.12567793484688464),
  ('slot', 0.06593291009958513),
  ('card', 0.061838932791374174),
  ('micronic', 0.04302984913982701),
  ('magstripe', 0.03178263286732731),
  ('magnetic', 0.02959496706803388),
  ('case', 0.028608629884068457),
  ('powersupply', 0.02594117764239604),
  ('chassis', 0.025817909483896208),
  ('micron', 0.02558250809055905)],
 139: [('serialport', 0.17334336446373577),
  ('serial', 0.08482946681218848),
  ('device', 0.07492807245872372),
  ('port', 0.0700109767811955),
  ('connect', 0.04944489179575653),
  ('simultaneously', 0.04497395706199964),
  ('printer', 0.043509753391128114),
  ('swii', 0.040081406936457516),
  ('working', 0.03931854958151048),
  ('modem', 0.038295382528227055)],
 140: [('comicstrip', 0.10522537805584628),
  ('copy', 0.07830774935773173),
  ('appear', 0.07760394526187306),
  ('annual', 0.07535806879161107),
  ('cover', 0.07398160888926149),
  ('wolverine', 0.060943595258260166),
  ('newmutant', 0.05597956900201285),
  ('art', 0.04990145439451255),
  ('mcfarlane', 0.04464030785252035),
  ('punisher', 0.04464030785252035)],
 141: [('bulb', 0.10309323978340915),
  ('uvlight', 0.06520030218450448),
  ('brightness', 0.0643678652100585),
  ('uv', 0.06357469121693145),
  ('string', 0.05585920146847542),
  ('blinker', 0.05220010570696789),
  ('uvflashlight', 0.044438198504159525),
  ('fluorescent', 0.0407501888653153),
  ('cuetape', 0.03752492747794877),
  ('glow', 0.037115466052533595)]}

I can view all the topics discussed in all of the documents.

Topic Visualization¶

Visualizing BERTopic and its derivatives is important in understanding the model, how it works, but more importantly, where it works. Since topic modeling can be quite a subjective field it is difficult for users to validate their models. Looking at the topics and seeing if they make sense is an important factor in alleviating this issue.

Intertopic Distance Map¶
In [32]:
model2.visualize_topics()
Topic Hierarchy¶
In [33]:
model2.visualize_hierarchy()
Visualize Terms¶
In [34]:
model2.visualize_barchart()

Information Retrieval Using SentenceTransformer¶

In [78]:
query=input('Enter the query here :')
query_embedding = model.encode(query)
Enter the query here :What is your view regarding gun control? 
In [80]:
top_k=5
cos_scores = util.cos_sim(query_embedding, embeddings)[0]
cos_scores = cos_scores.cpu()

# using torch.topk to find the highest 5 scores
top_results = torch.topk(cos_scores, k=top_k)

print("\n\n======================\n\n")
print("Query:", query)
print("\nTop 5 Most Similar Sentences in the Corpus:\n")

for score, idx in zip(top_results[0], top_results[1]):
    print(df['text_cleaned'].values[idx], "(Score: %.4f)" % (score))

======================


Query: What is your view regarding gun control? 

Top 5 Most Similar Sentences in the Corpus:

threaten gun_owner line nntp_poste host m article write m write story future gun_control point welcome opinion wonderful resource newsgroup take advantage thank advance feedback believe serious threat gun_owner future government liberal dea see concerned ammendment reinterpret apply armed_force bar civilian own arm kind well contribution taxis abortion elimination fetal tissue happen control type arm people allow buy type feel compel restrict military use hydrogen bomb perhaps describe hci gun_control activist determine make illegal civilian firearm personally read brady_bill entirety thank know truth truth make free  (Score: 0.5477)
gun_control mad tv news nntp_poste organization university line article steve_mane write know state gun_control effect homicide_rate think argue effect effect also consider negative side law_abide citizen armed pistol part prevent national crime year extreme study find number crime homicide private ownership firearm approximately live year roughly criminal homicide fatal accident involve gun year net benefit show gun_control measure disarm criminal currently use gun hard accord federal batf criminal buy gun counter gun_control law nature effect legal sale law remove benefit arm law_abide citizen minimal effect armed criminal large get gun illegally sound net benefit license weapon licensed weapon assume support reasonable law waiting_period background_check license complete ban alter statistic refer assume s support way people die fall stair accidental handgun death significant next household accident american child accidentally shoot child last year handgun_homicide child age die drown drink poisonous household chemical drano fall real goal reduce tragic accidental_death child ban drain cleaner well palce start perhaps restrict ownership professional plumber please dictionary argument rate total number re offer emphasis comparison call emphasis refer completely statistic sentence comparison valid put number together convince people right kind thing call propaganda cu_boulder  (Score: 0.5366)
ban firearm life health_science line article paul_prescod write drug ban tell supply dry drug easy manufacture easy smuggle easy hide comparison ignorant fool know drug business gun business editor freedom network international society market fax think universally act selfishly  (Score: 0.5252)
gun law organization canadian moderator nice summary thank talk federal try clarify bunch thing regard change canadian gun law post informational purpose question email followup still technically feasible almost impossible get tell still legal lethal force protect life also contrary officer tell gun store lock unload however regard capacity magazine still clear exempt manage province general idea exempt person receive letter form authorize possess high capacity magazine apparently authorization specify many prohibit weapon allow possess dealer allow order high capacity mag allow possess allow stock high capacity magazine convert comply new limit consider prohibit weapon amendment regulation specify possible method alter marketing reduce capacity magazine know much charge cover discuss type memory take gospel lawyer refuse play tv ofah frontenac club  (Score: 0.4903)
gun backcountry thank university line article write wrong whole gun protection mindset ignore systemic effect cumulative individual action want fire insurance house s prudent effect bunch paranoid pack handgun backcountry make else choose protect manner pretty king nervous re threat re affect mean take logical conclusion suppose carry handgun time protection people carry handgun collectively feel safe hell d feel lot insecure note available psych info say feeling security increase victimization stat say increase rational systemic effect good people protect bad people go modify behavior response re go much itchier much willing kill people course routine mugging think happen instead switch change behavior property crime s improvement even economic take unchanged sure switch kill  (Score: 0.4898)

Using the torch package combined with the sentence transformer from the BERTopic model I can infer the question with the topmost similar sentences in the corpus and evaluate each sentence with respect to the provided query.

That is all for the implementation of those topic models. Each model has a very different approach, and the one to choose from depends greatly on the problem one is trying to solve.

Thank You